[
https://issues.apache.org/jira/browse/SOLR-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SOLR-17575:
----------------------------------
Labels: pull-request-available (was: )
> Solr Langid backwards compatibility with the legacy "langid.whitelist" is
> broken
> --------------------------------------------------------------------------------
>
> Key: SOLR-17575
> URL: https://issues.apache.org/jira/browse/SOLR-17575
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: contrib - LangId
> Affects Versions: 9.1, 9.2, 9.3, 9.4, 9.5, 9.6
> Reporter: Alexander Zagniotov
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I’m seeking your feedback regarding an issue I’ve encountered when
> configuring the Solr Langid module, specifically when using the deprecated
> {{langid.whitelist}} property instead of Solr’s newer {{langid.allowlist}}
> property to define allowed language codes.
> As you are likely aware, the {{langid.whitelist}} property has been
> deprecated since Solr 9.0.0, and the recommended approach is to use
> {{langid.allowlist}} instead. I am indeed using the {{langid.allowlist}}
> property, but I would like to bring attention to an issue I’ve observed with
> the legacy support for {{{}langid.whitelist{}}}. I believe there is a bug in
> the backward compatibility code that could cause unintended behavior when the
> {{langid.whitelist}} property is configured.
> To illustrate the problem, I’ll provide a detailed example based on the code:
> # {*}The check for {{legacyAllowList}}{*}: In the Solr code, specifically in
> the
> [https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127],
> there is a check for the length of the {{legacyAllowList}} string. However,
> the {{legacyAllowList}} is never actually used after the length check in the
> code. Instead, an empty string ({{{}""{}}}) is used as the default value when
> fetching the {{LANG_ALLOWLIST}} parameter.
> # {*}Resulting issue with the {{langAllowlist}} set{*}: As a result, the
> {{Set<String> langAllowlist}} is populated with a single element: an empty
> string ({{{}""{}}}). This causes an issue when the code checks if the
> {{langAllowlist}} is empty in the later part of the code
> ([https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405])
> , specifically in this section. The check {{langAllowlist.isEmpty()}}
> incorrectly returns {{false}} because the set does contain an element - the
> empty string.
> # {*}Unexpected fallback behavior{*}: Consequently, even though the language
> of the document might be correctly detected (for instance, if the document is
> identified as being in German), the flow incorrectly enters the "else"
> clause. This results in the log message: _"Detected a language not in
> allowlist (de), using fallback en"_ and the fallback language is set to
> English ({{{}en{}}}), even though the document language was correctly
> identified as German.
> I believe this behavior stems from a bug in the backwards compatibility
> handling for the deprecated {{langid.whitelist}} property. If the
> {{legacyAllowList}} value is not being properly used or passed to the
> {{langAllowlist}} set, it leads to incorrect fallback behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]