William Martin created SOLR-10128:
-------------------------------------
Summary: langid.map.individual set to "true" is ignored
Key: SOLR-10128
URL: https://issues.apache.org/jira/browse/SOLR-10128
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Environment: Solr 6.0.4+
Reporter: William Martin
Priority: Minor
The
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
has a bug in it where it does not respect the "langid.map.individual" parameter
in solrconfig.xml. The documentation for langid.map.individual specifies:
{quote}
If you require detecting languages separately for each field, supply
langid.map.individual=true. The supplied fields will then be renamed according
to detected language on an individual field basis.
{quote}
However, when this field is set to "true" the fields are still mapped to the
language code of the entire document. For example: With the following snippet:
{code:xml|title=solrconfig.xml}
<processor
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,text</str>
<str name="langid.langField">language_s</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.individual">true</bool>
</lst>
</processor>
{code}
a document that takes the form
{code:javascript}
{
"title": "This is an English title",
"text": "Pero el texto de este documento está en español."
}
{code}
will be turned into
{code:javascript}
{
"title_es": "This is an english title",
"text_es": "Pero el texto de este documento está en español.",
"language_s": ["es"]
}
{code}
rather than
{code:javascript}
{
"title_en": "This is an english title",
"text_es": "Pero el texto de este documento está en español.",
"language_s": ["es","en"]
}
{code}
during processing.
This bug seems to have been introduced in SOLR-3881 when the abstract method
{code:java|title=LangDetectLanguageIdentifierUpdateProcessor.java}
protected List<DetectedLanguage> detectLanguage(String content)
{code}
was changed to the signature
{code:java|title=LangDetectLanguageIdentifierUpdateProcessor.java}
protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)
{code}
which does not allow one to recognize individual fields while preforming
language detection. As it stands, the entire document is analyzed per
individual field (included in the "langid.fl" or "langid.map.individual.fl"
parameters) and the field is mapped to the language of the entire document.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]