Alexander Zagniotov created SOLR-17575:
------------------------------------------

             Summary: Solr Langid backwards compatibility with the legacy 
"langid.whitelist" is broken
                 Key: SOLR-17575
                 URL: https://issues.apache.org/jira/browse/SOLR-17575
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: contrib - LangId
    Affects Versions: 9.6, 9.5, 9.4, 9.3, 9.2, 9.1
            Reporter: Alexander Zagniotov


I’m seeking your feedback regarding an issue I’ve encountered when configuring 
the Solr Langid module, specifically when using the deprecated 
{{langid.whitelist}} property instead of Solr’s newer {{langid.allowlist}} 
property to define allowed language codes.

As you are likely aware, the {{langid.whitelist}} property has been deprecated 
since Solr 9.0.0, and the recommended approach is to use {{langid.allowlist}} 
instead. I am indeed using the {{langid.allowlist}} property, but I would like 
to bring attention to an issue I’ve observed with the legacy support for 
{{{}langid.whitelist{}}}. I believe there is a bug in the backward 
compatibility code that could cause unintended behavior when the 
{{langid.whitelist}} property is configured.

To illustrate the problem, I’ll provide a detailed example based on the code:
 # {*}The check for {{legacyAllowList}}{*}: In the Solr code, specifically in 
the 
[https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127],
 there is a check for the length of the {{legacyAllowList}} string. However, 
the {{legacyAllowList}} is never actually used after the length check in the 
code. Instead, an empty string ({{{}""{}}}) is used as the default value when 
fetching the {{LANG_ALLOWLIST}} parameter.

 # {*}Resulting issue with the {{langAllowlist}} set{*}: As a result, the 
{{Set<String> langAllowlist}} is populated with a single element: an empty 
string ({{{}""{}}}). This causes an issue when the code checks if the 
{{langAllowlist}} is empty in the later part of the code 
([https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405])
 , specifically in this section. The check {{langAllowlist.isEmpty()}} 
incorrectly returns {{false}} because the set does contain an element - the 
empty string.

 # {*}Unexpected fallback behavior{*}: Consequently, even though the language 
of the document might be correctly detected (for instance, if the document is 
identified as being in German), the flow incorrectly enters the "else" clause. 
This results in the log message: _"Detected a language not in allowlist (de), 
using fallback en"_ and the fallback language is set to English ({{{}en{}}}), 
even though the document language was correctly identified as German.

I believe this behavior stems from a bug in the backwards compatibility 
handling for the deprecated {{langid.whitelist}} property. If the 
{{legacyAllowList}} value is not being properly used or passed to the 
{{langAllowlist}} set, it leads to incorrect fallback behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to