[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647 ]
Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:14 PM: ------------------------------------------------------------------- For posterity, this is the worst case test that spreads out terms {{ for (int i = 0; i < 1000000; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j){ b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed();}} was (Author: sokolov): {{For posterity, this is the worst case test that spreads out terms}} for (int i = 0; i < 1000000; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j) { b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed(); > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org