[
https://issues.apache.org/jira/browse/LUCENE-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470558#comment-17470558
]
Robert Muir commented on LUCENE-10364:
--------------------------------------
> It was complaining about Character#getNumericValue(): This is a good hint,
> but in our case we were only using DECIMAL digits. For DecimalDigitFilter
> this is fine. Maybe rmuir should have a look at the unicode rules processing
> in GenerateUTR30DataFiles. Please don't see this as "Robert does not know
> Unicode", I just want to verify that the SuppressWarnings is fine, because I
> did not understand the code there. The problem is that
> UCharacter.getNumericValue() returns values outside 0..9 for roman numbers
> like 50. So adding it to the character '0' (0x30) to generate ASCII digit is
> not a good idea. DecimalDigitFilter does not do this, but for
> GenerateUTR30DataFiles I am unsure. So this should be verified!
I didn't write this file, but i may have "touched it last" :)
The code applies UnicodeSet to filter codepoints it works on:
*
https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java#L233-L234
*
https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/data/utr30/NativeDigitFolding.txt#L33
You can see the set visually here:
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%5B%3ANumeric_Type%3DDigit%3A%5D%5B%3ANd%3A%5D%5D+-+%5B%5B%3AChanges_When_NFKC_Casefolded%3DYes%3A%5D%5B%3ABlock%3DSuperscripts_And_Subscripts%3A%5D%5B%5Cu00B2%5Cu00B3%5Cu00B9%5D%5B%5Cu0030-%5Cu0039%5D%5D%5D&g=&i=
The key is the first part of the expression in the set:
{{[[:Numeric_Type=Digit:][:Nd:]]}}. This logic only operates on DIGITS. There
is nothing wrong with it.
So to me this check from error-prone is stupid and noisy, and should be
disabled if possible? (just like the rest of error-prone, sorry)
> Prepare and update errorprone plugin for Java 17
> ------------------------------------------------
>
> Key: LUCENE-10364
> URL: https://issues.apache.org/jira/browse/LUCENE-10364
> Project: Lucene - Core
> Issue Type: Bug
> Components: general/build
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When working on LUCENE-10283 and also SOLR-15876, we figured out that
> errorprone is now also able to run with Java 17, if we update it and if it
> runs inside Gradle's JVM. This was caused by the add-opens we did for
> Spotless previously.
> There is only one case where it does not work: If you run spotless in a
> forked compiler, because the Gradle options are not applied then. The new
> Spotless plugin can handle this, but it won't work with our customized build
> for some reason. So I changed the if clause a bit, so it wont run errorprone
> if you use a JDK-18 preview build with RUNTIME_JAVA_HOME.
> When updating the rules it also found new bugs, some of them were real
> problems:
> - some tests were comparing Longs as Floats. The resason for this was when
> Suggesters changed to use Longs instead of Floats. In a similar way sometimes
> we assign a long to a float score. The first on was easy to fix by removing
> the epssilon from the assertEquals, the latter was mostly adding an explicit
> cast (to make it clear in our scorers)
> - There were also some concurrent modification exceptions possible, i fixed
> this in test by making a clone before modifying. For those using a TreeMap it
> was fine.
> - It was complaining about Character#getNumericValue(): This is a good hint,
> but in our case we were only using DECIMAL digits. For DecimalDigitFilter
> this is fine. Maybe [~rmuir] should have a look at the unicode rules
> processing in GenerateUTR30DataFiles. Please don't see this as "Robert does
> not know Unicode", I just want to verify that the SuppressWarnings is fine,
> because I did not understand the code there. The problem is that
> UCharacter.getNumericValue() returns values outside 0..9 for roman numbers
> like 50. So adding it to the character '0' (0x30) to generate ASCII digit is
> not a good idea. DecimalDigitFilter does not do this, but for
> GenerateUTR30DataFiles I am unsure. So this should be verified!
> - Some equals() methods were comparing primitives with Objects.equals(). This
> causes boxing and should be avoided (although Hotspot removes this after
> enough iterations)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]