[
https://issues.apache.org/jira/browse/LUCENE-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989060#comment-16989060
]
Erick Erickson edited comment on LUCENE-9080 at 12/5/19 6:50 PM:
-----------------------------------------------------------------
[~rcmuir] Thanks, that put me on a path to figure some things out. I'm still
baffled, just in a different way.
tl;dr;
- It looks like there are a bunch of hand-edits that are unimportant. They
should be fixed at the source though if possible.
- There are a couple of hand-edits that should be fixed in the input source
rather than the output.
-- see LUCENE-8683 and Nikolay Khitrin's comments/work for specific instances
of hand-edits to java files that should be moved. [~dsmiley] [~sarowe] there
are a couple of JIRAs mentioned in LUCENE-8683 , I may be asking you glance at
the ones you worked on and see if you recall anything about those changes.
- We should upgrade javacc to 6.0, we're getting deprecated methods generated
(I think)
LONG FORM:
I tried going to branch_8x, Java8 and spoofing the bits that download nfc.txt
nfkc.txt nfkc_cf.txt to use what's already checked out. If I ignore all the
obvious hand-edits and checksum differences and bogus imports, here's what's
still weird:
- HTMLCharEntites.jflex acquires an added pair of parens near the end: ('
| "zwj" | "zwnj"', ')')
Several binary files show differences, but whether that's just my IDE not being
able to deal with the charsets IDK.
- org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/icu/utr30.nrm is different
- 9 test binary files en-test-*.bin
TestICUFoldingFilterFactory still fails, here's one.
- ant test -Dtestcase=TestICUFoldingFilterFactory
-Dtests.method=testBogusArguments -Dtests.seed=311B6E926642DA19
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kn-IN
-Dtests.timezone=Europe/Tirane -Dtests.asserts=true -Dtests.file.encoding=UTF-8
with this error:
Caused by: java.io.IOException: ICU data file error: Header authentication
failed, please check if you have a valid ICU data file; data format 4e726d32,
format version 4.0.0.0
I see jflex was upgraded, but I don't think regenerate was run after that.
This is virtually identical to what I get when trying this on master, pulling
down new nfck*.txt files. Well, not guaranteeing the binary files are
identical. There are a few other differences like:
{code}
exptokseq[i] = jj_expentries.get(i); (old)
.vs.
exptokseq[i] = (int[])jj_expentries.get(i); (new, hand edit I think?)
{code}
and these files aren't present in master at all, they're "Untracked" according
to Git.
lucene/core/src/java/org/apache/lucene/util/packed/Direct*.java
lucene/core/src/java/org/apache/lucene/util/packed/Packed*ThreeBlocks.java
the TestICUFoldingFilterFactory still fails
was (Author: erickerickson):
[~rcmuir] Thanks, that put me on a path to figure some things out. I'm still
baffled, just in a different way.
tl;dr;
- It looks like there are a bunch of hand-edits that are unimportant. They
should be fixed at the source though if possible.
- There are a couple of hand-edits that should be fixed in the input source
rather than the output.
-- see LUCENE-8683 and Nikolay Khitrin's comments/work for specific instances
of hand-edits to java files that should be moved. [~dsmiley] [~sarowe] there
are a couple of JIRAs mentioned in LUCENE-8683 , I may be asking you glance at
the ones you worked on and see if you recall anything about those changes.
- We should upgrade javacc to 6.0, we're getting deprecated methods generated
(I think)
LONG FORM:
I tried going to branch_8x, Java8 and spoofing the bits that download nfc.txt
nfkc.txt nfkc_cf.txt to use what's already checked out. If I ignore all the
obvious hand-edits and checksum differences and bogus imports, here's what's
still weird:
- HTMLCharEntites.jflex acquires an added pair of parens near the end: ('
| "zwj" | "zwnj"', ')')
Several binary files show differences, but whether that's just my IDE not being
able to deal with the charsets IDK.
- org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/icu/utr30.nrm is different
- 9 test binary files en-test-*.bin
TestICUFoldingFilterFactory still fails, here's one.
- ant test -Dtestcase=TestICUFoldingFilterFactory
-Dtests.method=testBogusArguments -Dtests.seed=311B6E926642DA19
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kn-IN
-Dtests.timezone=Europe/Tirane -Dtests.asserts=true -Dtests.file.encoding=UTF-8
with this error:
Caused by: java.io.IOException: ICU data file error: Header authentication
failed, please check if you have a valid ICU data file; data format 4e726d32,
format version 4.0.0.0
I see jflex was upgraded, but I don't think regenerate was run after that.
This is virtually identical to what I get when trying this on master, pulling
down new nfck*.txt files. Well, not guaranteeing the binary files are
identical. There are a few other differences like:
exptokseq[i] = jj_expentries.get(i); (old)
.vs.
exptokseq[i] = (int[])jj_expentries.get(i); (new, hand edit I think?)
and these files aren't present in master at all, they're "Untracked" according
to Git.
lucene/core/src/java/org/apache/lucene/util/packed/Direct*.java
lucene/core/src/java/org/apache/lucene/util/packed/Packed*ThreeBlocks.java
the TestICUFoldingFilterFactory still fails
> "ant regenerate" fails on master
> --------------------------------
>
> Key: LUCENE-9080
> URL: https://issues.apache.org/jira/browse/LUCENE-9080
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: after_regen.patch, before_regen.patch, status.res
>
>
> The root cause is that RamUsageEstimator.NUM_BYTES_INT has been removed and
> the python scripts still reference it in the generated scripts. That part's
> easy to fix.
> Last time I looked, though, the regenerate produces some differences in the
> generated files that should be looked at to insure they're benign.
> Not really sure whether this should be a Lucene or Solr JIRA. Putting it in
> Lucene since one of the failed files is:
> lucene/core/src/java/org/apache/lucene/util/packed/Packed8ThreeBlocks.java
> I do know that one of the Solr jflex-produced file has an unexplained
> difference so it may bleed over.
> "ant regenerate" needs about 24G on my machine FWIW.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]