[ 
https://issues.apache.org/jira/browse/SOLR-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SOLR-17575:
----------------------------------
    Labels: pull-request-available  (was: )

> Solr Langid backwards compatibility with the legacy "langid.whitelist" is 
> broken
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-17575
>                 URL: https://issues.apache.org/jira/browse/SOLR-17575
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - LangId
>    Affects Versions: 9.1, 9.2, 9.3, 9.4, 9.5, 9.6
>            Reporter: Alexander Zagniotov
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I’m seeking your feedback regarding an issue I’ve encountered when 
> configuring the Solr Langid module, specifically when using the deprecated 
> {{langid.whitelist}} property instead of Solr’s newer {{langid.allowlist}} 
> property to define allowed language codes.
> As you are likely aware, the {{langid.whitelist}} property has been 
> deprecated since Solr 9.0.0, and the recommended approach is to use 
> {{langid.allowlist}} instead. I am indeed using the {{langid.allowlist}} 
> property, but I would like to bring attention to an issue I’ve observed with 
> the legacy support for {{{}langid.whitelist{}}}. I believe there is a bug in 
> the backward compatibility code that could cause unintended behavior when the 
> {{langid.whitelist}} property is configured.
> To illustrate the problem, I’ll provide a detailed example based on the code:
>  # {*}The check for {{legacyAllowList}}{*}: In the Solr code, specifically in 
> the 
> [https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127],
>  there is a check for the length of the {{legacyAllowList}} string. However, 
> the {{legacyAllowList}} is never actually used after the length check in the 
> code. Instead, an empty string ({{{}""{}}}) is used as the default value when 
> fetching the {{LANG_ALLOWLIST}} parameter.
>  # {*}Resulting issue with the {{langAllowlist}} set{*}: As a result, the 
> {{Set<String> langAllowlist}} is populated with a single element: an empty 
> string ({{{}""{}}}). This causes an issue when the code checks if the 
> {{langAllowlist}} is empty in the later part of the code 
> ([https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405])
>  , specifically in this section. The check {{langAllowlist.isEmpty()}} 
> incorrectly returns {{false}} because the set does contain an element - the 
> empty string.
>  # {*}Unexpected fallback behavior{*}: Consequently, even though the language 
> of the document might be correctly detected (for instance, if the document is 
> identified as being in German), the flow incorrectly enters the "else" 
> clause. This results in the log message: _"Detected a language not in 
> allowlist (de), using fallback en"_ and the fallback language is set to 
> English ({{{}en{}}}), even though the document language was correctly 
> identified as German.
> I believe this behavior stems from a bug in the backwards compatibility 
> handling for the deprecated {{langid.whitelist}} property. If the 
> {{legacyAllowList}} value is not being properly used or passed to the 
> {{langAllowlist}} set, it leads to incorrect fallback behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to