Hello,

While using Solr 6.0.4 I noticed that the
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
has a bug in it where it does not respect the "langid.map.individual"
parameter in solrconfig.xml. The documentation for langid.map.individual
<https://wiki.apache.org/solr/LanguageDetection#langid.map.individual>
specifies:

If you require detecting languages separately for each field, supply
> langid.map.individual=true. The supplied fields will then be renamed
> according to detected language on an individual field basis.
>

However, when this field is set to "true" the fields are still mapped to
the language code of the entire document. For example: With the following
snippet from solrconfig.xml

<processor 
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,text</str>
     <str name="langid.langField">language_s</str>
     <bool name="langid.map">true</bool>
     <bool name="langid.map.individual">true</bool>
   </lst></processor>

a document that takes the form

{
  "title": "This is an English title",
  "text": "Pero el texto de este documento está en español."
}

will be turned into

{
  "title_es": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es"]
}

rather than

{
  "title_en": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es","en"]
}

during processing.

This bug seems to have been introduced in SOLR-3881
<https://issues.apache.org/jira/browse/SOLR-3881> when the abstract method
(LangDetectLanguageIdentifierUpdateProcessor.java:52)

protected List<DetectedLanguage> detectLanguage(String content)

was changed to the signature

protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)

which does not allow one to recognize individual fields while preforming
language detection. As it stands, the entire document is analysed per
individual field (included in the "langid.fl" or "langid.map.individual.fl"
parameters) and the field is mapped to the language of the entire document.

I searched the Apache Jira for a ticket tracking this bug but did not find
anything that seemed related. I thought before filing a new ticket I would
ping this mailing list to see if anyone knows about work relating to this
issue or if there is already a ticket for it (not directly related to the
term "langid.map.individual" perhaps). If not I can go ahead and file the
ticket.


Thanks,

-William Martin

Reply via email to