[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Hoss Man (JIRA) Tue, 16 Jul 2019 15:20:14 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886513#comment-16886513
 ]


Hoss Man commented on LUCENE-8920:
----------------------------------

[~sokolov] - your revert on branch_8_2 seems to have broken most of the 
lucene/analysis/kuromoji tests with a common root cause...

{noformat}
  [junit4] ERROR   0.44s J0 | TestFactories.test <<<
   [junit4]    > Throwable #1: java.lang.ExceptionInInitializerError
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:215)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174)
   [junit4]    >        at 
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: Cannot load 
TokenInfoDictionary.
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:71)
   [junit4]    >        ... 46 more
   [junit4]    > Caused by: org.apache.lucene.index.IndexFormatTooNewException: 
Format version is not supported (resource 
org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 
6 and 6)
   [junit4]    >        at 
org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216)
   [junit4]    >        at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198)
   [junit4]    >        at org.apache.lucene.util.fst.FST.<init>(FST.java:275)
   [junit4]    >        at org.apache.lucene.util.fst.FST.<init>(FST.java:263)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:47)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:54)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:32)
   [junit4]    >        at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:69)
   [junit4]    >        ... 46 more

{noformat}

...perhaps due to "conflicting reverts" w/ LUCENE-8907 / LUCENE-8778 ?
/cc [~tomoko]

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Reply via email to