[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886513#comment-16886513 ]
Hoss Man commented on LUCENE-8920: ---------------------------------- [~sokolov] - your revert on branch_8_2 seems to have broken most of the lucene/analysis/kuromoji tests with a common root cause... {noformat} [junit4] ERROR 0.44s J0 | TestFactories.test <<< [junit4] > Throwable #1: java.lang.ExceptionInInitializerError [junit4] > at __randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0) [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62) [junit4] > at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:215) [junit4] > at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150) [junit4] > at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82) [junit4] > at org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174) [junit4] > at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546) [junit4] > at org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81) [junit4] > at org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60) [junit4] > at java.lang.Thread.run(Thread.java:748) [junit4] > Caused by: java.lang.RuntimeException: Cannot load TokenInfoDictionary. [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:71) [junit4] > ... 46 more [junit4] > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 6 and 6) [junit4] > at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216) [junit4] > at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) [junit4] > at org.apache.lucene.util.fst.FST.<init>(FST.java:275) [junit4] > at org.apache.lucene.util.fst.FST.<init>(FST.java:263) [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:47) [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:54) [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:32) [junit4] > at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:69) [junit4] > ... 46 more {noformat} ...perhaps due to "conflicting reverts" w/ LUCENE-8907 / LUCENE-8778 ? /cc [~tomoko] > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org