[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061195#comment-13061195 ]
Michael McCandless commented on LUCENE-3289: -------------------------------------------- NOTE: patch applies to 3.x. I ran the patch on the titles, varying the max prefix sharing length: ||Len||FST Size||Seconds|| |1|135446807|8.2| |2|137632702|8.5| |3|135177994|8.3| |4|132782016|8.3| |5|130415331|8.4| |6|128086200|8.0| |7|125797396|8.2| |8|123552157|8.5| |9|121358375|8.4| |10|119228942|8.1| |11|117181180|8.8| |12|115229788|8.7| |13|113388260|9.5| |14|111664442|9.0| |15|110059167|9.2| |16|108572519|9.7| |17|107201905|9.8| |18|105942576|10.3| |19|104791497|10.1| |20|103745678|11.1| |21|102801693|10.8| |22|101957797|11.4| |23|101206564|11.1| |24|100541849|11.0| |25|99956443|11.1| |26|99443232|12.9| |27|98995194|13.2| |28|98604680|13.9| |29|98264184|13.5| |30|97969241|13.6| |31|97714049|13.8| |32|97494104|14.3| |33|97304045|14.0| |34|97140033|14.3| |35|96998942|14.6| |36|96877590|16.5| |37|96773039|16.9| |38|96682961|16.6| |39|96605160|17.8| |40|96537687|18.3| |41|96479286|17.8| |42|96428710|17.5| |43|96384659|18.9| |44|96346174|17.0| |45|96312826|19.3| |46|96283545|17.8| |47|96257708|19.4| |48|96235159|19.0| |49|96215220|18.7| |50|96197450|19.6| |51|96181539|17.3| |52|96167235|16.9| |53|96154490|17.7| |54|96143081|18.8| |55|96132905|17.4| |56|96123776|17.5| |57|96115462|20.7| |58|96108051|19.2| |59|96101249|19.1| |60|96095107|18.7| |ALL|96020343|22.5| Very very odd that FST size first goes up at N=2... not yet sure why. But from this curve it looks like there is a sweet spot around maybe N=24. I didn't measure required heap here, but it also will go down as N goes down. > FST should allow controlling how hard builder tries to share suffixes > --------------------------------------------------------------------- > > Key: LUCENE-3289 > URL: https://issues.apache.org/jira/browse/LUCENE-3289 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.4, 4.0 > > Attachments: LUCENE-3289.patch > > > Today we have a boolean option to the FST builder telling it whether > it should share suffixes. > If you turn this off, building is much faster, uses much less RAM, and > the resulting FST is a prefix trie. But, the FST is larger than it > needs to be. When it's on, the builder maintains a node hash holding > every node seen so far in the FST -- this uses up RAM and slows things > down. > On a dataset that Elmer (see java-user thread "Autocompletion on large > index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles > avg 67.3 chars per title, building with suffix sharing on took 22.5 > seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix > sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. > I think we should allow this boolean to be shade-of-gray instead: > usually, how well suffixes can share is a function of how far they are > from the end of the string, so, by adding a tunable N to only share > when suffix length < N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org