[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061804#comment-13061804 ]
Eks Dev commented on LUCENE-3289: --------------------------------- bq. The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one? That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago we used one trie, for all unique terms to get prefix/edit distance on words and one word-trie (symbols were words via symbol table) for "documents". I am sure this would cut memory requirements significantly for multiword cases when compared to char level FST. e.g. TermDictionary that supports ord() could be used as a symbol table. > FST should allow controlling how hard builder tries to share suffixes > --------------------------------------------------------------------- > > Key: LUCENE-3289 > URL: https://issues.apache.org/jira/browse/LUCENE-3289 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.4, 4.0 > > Attachments: LUCENE-3289.patch, LUCENE-3289.patch > > > Today we have a boolean option to the FST builder telling it whether > it should share suffixes. > If you turn this off, building is much faster, uses much less RAM, and > the resulting FST is a prefix trie. But, the FST is larger than it > needs to be. When it's on, the builder maintains a node hash holding > every node seen so far in the FST -- this uses up RAM and slows things > down. > On a dataset that Elmer (see java-user thread "Autocompletion on large > index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles > avg 67.3 chars per title, building with suffix sharing on took 22.5 > seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix > sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. > I think we should allow this boolean to be shade-of-gray instead: > usually, how well suffixes can share is a function of how far they are > from the end of the string, so, by adding a tunable N to only share > when suffix length < N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org