[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: Results.png Graph showing perf results. Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948_automaton.patch, Results.png Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: LUCENE-2948.patch New patch, fixes the bug Robert found (thanks!) and a few more adds a new test case. Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948_automaton.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: LUCENE-2948.patch Initial patch. This is a checkpoint of work-in-progress -- all tests pass, but there are zillions of nocommits to be resolved... Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: LUCENE-2948.patch New patch -- changes nextPossiblePrefix to return a SeekStatus. Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2948: Attachment: LUCENE-2948_automaton.patch Nice work Mike, i think I found a bug with nextPossiblePrefix though? I attached my modifications to try to use this with Automaton. (just the automaton parts): I'm also somehow triggering BlockReader's assert about crossing over index terms with other tests... I think i could see the problem here... is it that nextPossiblePrefix(BytesRef prefix) means it wants me to truly pass in a prefix? obviously a consumer doesn't know which portion of his term is/isnt a prefix! So we would have to expose that :(, or alternatively change the semantics to nextPossiblePrefix(BytesRef term)? In other words, in this situation of 1[\u]234567891 it would simply return true, because it knows 1* exists rather than forwarding me to s? Maybe this is what was intended all along and its just an off by one? {noformat} [junit] NOTE: reproduce with: ant test -Dtestcase=TestFuzzyQuery -Dtestmethod=testTokenLengthOpt -Dtests.seed=4471452442745287654:-2341611255635429887 -Dtests.codec=Standard // NOTE: this index has two terms: 12345678911 and segment [junit] - Standard Output --- [junit] candidate: [\u]1234567891 [junit] not found, goto: 1 [junit] candidate: 1[\u]234567891 [junit] not found, goto: s --- this is the problem, because 12345678911 exists [junit] candidate: s1234567891 [junit] found! [junit] candidate: t1234567891 [junit] found! {noformat} Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948_automaton.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org