[jira] [Created] (SOLR-10128) langid.map.individual set to "true" is ignored

William Martin (JIRA) Sun, 12 Feb 2017 14:29:08 -0800

William Martin created SOLR-10128:
-------------------------------------

             Summary: langid.map.individual set to "true" is ignored
                 Key: SOLR-10128
                 URL: https://issues.apache.org/jira/browse/SOLR-10128
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
         Environment: Solr 6.0.4+
            Reporter: William Martin
            Priority: Minor



The 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor 
has a bug in it where it does not respect the "langid.map.individual" parameter 
in solrconfig.xml. The documentation for langid.map.individual specifies:
{quote}
If you require detecting languages separately for each field, supply 
langid.map.individual=true. The supplied fields will then be renamed according 
to detected language on an individual field basis.
{quote}
However, when this field is set to "true" the fields are still mapped to the 
language code of the entire document. For example: With the following snippet:
{code:xml|title=solrconfig.xml}
<processor 
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,text</str>
     <str name="langid.langField">language_s</str>
     <bool name="langid.map">true</bool>
     <bool name="langid.map.individual">true</bool>
   </lst>
</processor>
{code}
a document that takes the form
{code:javascript}
{
  "title": "This is an English title",
  "text": "Pero el texto de este documento está en español."
}
{code}
will be turned into
{code:javascript}
{
  "title_es": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es"]
}
{code}
rather than
{code:javascript}
{
  "title_en": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es","en"]
}
{code}
during processing.

This bug seems to have been introduced in SOLR-3881 when the abstract method
{code:java|title=LangDetectLanguageIdentifierUpdateProcessor.java}
protected List<DetectedLanguage> detectLanguage(String content)
{code}
was changed to the signature
{code:java|title=LangDetectLanguageIdentifierUpdateProcessor.java}
protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)
{code}
which does not allow one to recognize individual fields while preforming 
language detection. As it stands, the entire document is analyzed per 
individual field (included in the "langid.fl" or "langid.map.individual.fl" 
parameters) and the field is mapped to the language of the entire document.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-10128) langid.map.individual set to "true" is ignored

Reply via email to