[jira] [Comment Edited] (LUCENE-9080) "ant regenerate" fails on master

Erick Erickson (Jira) Thu, 05 Dec 2019 10:51:09 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989060#comment-16989060
 ]


Erick Erickson edited comment on LUCENE-9080 at 12/5/19 6:50 PM:
-----------------------------------------------------------------

[~rcmuir] Thanks, that put me on a path to figure some things out. I'm still 
baffled, just in a different way.

tl;dr;

- It looks like there are a bunch of hand-edits that are unimportant. They 
should be fixed at the source though if possible.

- There are a couple of hand-edits that should be fixed in the input source 
rather than the output.
-- see LUCENE-8683 and Nikolay Khitrin's comments/work for specific instances 
of hand-edits to java files that should be moved. [~dsmiley] [~sarowe] there 
are a couple of JIRAs mentioned in LUCENE-8683 , I may be asking you glance at 
the ones you worked on and see if you recall anything about those changes.

- We should upgrade javacc to 6.0, we're getting deprecated methods generated 
(I think)

LONG FORM:

I tried going to branch_8x, Java8 and spoofing the bits that download nfc.txt 
nfkc.txt nfkc_cf.txt to use what's already checked out. If I ignore all the 
obvious hand-edits and checksum differences and bogus imports, here's what's 
still weird:

- HTMLCharEntites.jflex acquires an added pair of parens near the end: ('       
             | "zwj" | "zwnj"', ')')

Several binary files show differences, but whether that's just my IDE not being 
able to deal with the charsets IDK.
- org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/icu/utr30.nrm is different
- 9 test binary files en-test-*.bin

TestICUFoldingFilterFactory still fails, here's one.
- ant test  -Dtestcase=TestICUFoldingFilterFactory 
-Dtests.method=testBogusArguments -Dtests.seed=311B6E926642DA19 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kn-IN 
-Dtests.timezone=Europe/Tirane -Dtests.asserts=true -Dtests.file.encoding=UTF-8 

with this error:
Caused by: java.io.IOException: ICU data file error: Header authentication 
failed, please check if you have a valid ICU data file; data format 4e726d32, 
format version 4.0.0.0

I see jflex was upgraded, but I don't think regenerate was run after that.

This is virtually identical to what I get when trying this on master, pulling 
down new nfck*.txt files. Well, not guaranteeing the binary files are 
identical. There are a few other differences like:
{code}
exptokseq[i] = jj_expentries.get(i); (old)
.vs.
exptokseq[i] = (int[])jj_expentries.get(i); (new, hand edit I think?)
{code}

and these files aren't present in master at all, they're "Untracked" according 
to Git.
lucene/core/src/java/org/apache/lucene/util/packed/Direct*.java
lucene/core/src/java/org/apache/lucene/util/packed/Packed*ThreeBlocks.java

the TestICUFoldingFilterFactory still fails





was (Author: erickerickson):
[~rcmuir] Thanks, that put me on a path to figure some things out. I'm still 
baffled, just in a different way.

tl;dr;

- It looks like there are a bunch of hand-edits that are unimportant. They 
should be fixed at the source though if possible.

- There are a couple of hand-edits that should be fixed in the input source 
rather than the output.
-- see LUCENE-8683 and Nikolay Khitrin's comments/work for specific instances 
of hand-edits to java files that should be moved. [~dsmiley] [~sarowe] there 
are a couple of JIRAs mentioned in LUCENE-8683 , I may be asking you glance at 
the ones you worked on and see if you recall anything about those changes.

- We should upgrade javacc to 6.0, we're getting deprecated methods generated 
(I think)

LONG FORM:

I tried going to branch_8x, Java8 and spoofing the bits that download nfc.txt 
nfkc.txt nfkc_cf.txt to use what's already checked out. If I ignore all the 
obvious hand-edits and checksum differences and bogus imports, here's what's 
still weird:

- HTMLCharEntites.jflex acquires an added pair of parens near the end: ('       
             | "zwj" | "zwnj"', ')')

Several binary files show differences, but whether that's just my IDE not being 
able to deal with the charsets IDK.
- org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
- org/apache/lucene/analysis/icu/utr30.nrm is different
- 9 test binary files en-test-*.bin

TestICUFoldingFilterFactory still fails, here's one.
- ant test  -Dtestcase=TestICUFoldingFilterFactory 
-Dtests.method=testBogusArguments -Dtests.seed=311B6E926642DA19 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kn-IN 
-Dtests.timezone=Europe/Tirane -Dtests.asserts=true -Dtests.file.encoding=UTF-8 

with this error:
Caused by: java.io.IOException: ICU data file error: Header authentication 
failed, please check if you have a valid ICU data file; data format 4e726d32, 
format version 4.0.0.0

I see jflex was upgraded, but I don't think regenerate was run after that.

This is virtually identical to what I get when trying this on master, pulling 
down new nfck*.txt files. Well, not guaranteeing the binary files are 
identical. There are a few other differences like:

exptokseq[i] = jj_expentries.get(i); (old)
.vs.
exptokseq[i] = (int[])jj_expentries.get(i); (new, hand edit I think?)

and these files aren't present in master at all, they're "Untracked" according 
to Git.
lucene/core/src/java/org/apache/lucene/util/packed/Direct*.java
lucene/core/src/java/org/apache/lucene/util/packed/Packed*ThreeBlocks.java

the TestICUFoldingFilterFactory still fails




> "ant regenerate" fails on master
> --------------------------------
>
>                 Key: LUCENE-9080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9080
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: after_regen.patch, before_regen.patch, status.res
>
>
> The root cause is that RamUsageEstimator.NUM_BYTES_INT has been removed and 
> the python scripts still reference it in the generated scripts. That part's 
> easy to fix.
> Last time I looked, though, the regenerate produces some differences in the 
> generated files that should be looked at to insure they're benign.
> Not really sure whether this should be a Lucene or Solr JIRA. Putting it in 
> Lucene since one of the failed files is: 
> lucene/core/src/java/org/apache/lucene/util/packed/Packed8ThreeBlocks.java
> I do know that one of the Solr jflex-produced file has an unexplained 
> difference so it may bleed over.
> "ant regenerate" needs about 24G on my machine FWIW.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-9080) "ant regenerate" fails on master

Reply via email to