[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: Results.png

Graph showing perf results.

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, 
 LUCENE-2948_automaton.patch, Results.png


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: LUCENE-2948.patch

New patch, fixes the bug Robert found (thanks!) and a few more adds a new 
test case.

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, 
 LUCENE-2948_automaton.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: LUCENE-2948.patch

Initial patch.  This is a checkpoint of work-in-progress -- all tests pass, but 
there are zillions of nocommits to be resolved...

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: LUCENE-2948.patch

New patch -- changes nextPossiblePrefix to return a SeekStatus.

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2948:


Attachment: LUCENE-2948_automaton.patch

Nice work Mike, i think I found a bug with nextPossiblePrefix though?

I attached my modifications to try to use this with Automaton. (just the 
automaton parts):

I'm also somehow triggering BlockReader's assert about crossing over index 
terms with other tests...

I think i could see the problem here... is it that 
nextPossiblePrefix(BytesRef prefix) means it wants me to truly pass in a 
prefix? obviously a consumer doesn't know which portion of his term is/isnt a 
prefix!

So we would have to expose that :(, or alternatively change the semantics to 
nextPossiblePrefix(BytesRef term)? In other words, in this situation of 
1[\u]234567891 it would simply return true, because it knows 1* exists 
rather than forwarding me to s? Maybe this is what was intended all along and 
its just an off by one?

{noformat}
[junit] NOTE: reproduce with: ant test -Dtestcase=TestFuzzyQuery 
-Dtestmethod=testTokenLengthOpt 
-Dtests.seed=4471452442745287654:-2341611255635429887 -Dtests.codec=Standard

// NOTE: this index has two terms: 12345678911 and segment

[junit] - Standard Output ---
[junit] candidate: [\u]1234567891
[junit] not found, goto: 1
[junit] candidate: 1[\u]234567891
[junit] not found, goto: s --- this is the problem, because 12345678911 
exists
[junit] candidate: s1234567891
[junit] found!
[junit] candidate: t1234567891
[junit] found!
{noformat}

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, 
 LUCENE-2948_automaton.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org