[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to 
nodes can be 'merged'.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to 
nodes can be 'merged'.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:35 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 4:09 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.

By the way, for real data, when two outputs are not 'NO_OUTPUT', even they 
contains the same metadata + stats, 
it seems to be very seldom that their arcs can be identical on FST (increases 
less than 1MB for wikimedium1m if 
equals always return false for non-singleton argument). Therefore... yes, 
hashCode() isn't necessary here.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708638#comment-13708638
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 5:08 PM:


Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash.java:128).

Although later, for nodes with outputs, it'll hardly 
find a same node from hashtable.

  was (Author: billy):
Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash:128).
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:04 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary 

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. 
Considering different bit size, for df+ttf encoding, totally 
it saves 57.3MB from 148.7MB, using following estimation:


{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST 

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:00 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, --I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. 
Considering different bit size, for df+ttf encoding, totally 
it saves 57.3MB from 148.7MB, using following estimation:


{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST 

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, -I think the 
index size we can reduce is about 67.5MB- 
-(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah 
I forgot we already steals bit for this case in Lucene41PBF.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, --I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in 

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707780#comment-13707780
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 4:48 PM:


Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below(KB):
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?


  was (Author: billy):
Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below:
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?

  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-12 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706703#comment-13706703
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 1:42 AM:


Uploaded patch, it is the main part of changes I commited to branch3069.

The picture shows current impl of outputs (it is fetched from one field in 
wikimedium5k).

* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)

A single byte flag is used to indicate whether/which fields current outputs 
maintains, 
for PBF with short byte[], this should be enough. Also, for long-tail terms, 
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).


Since TermsEnum is totally based on FSTEnum, the performance of term dict 
should be similar with 
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this 
hurts.


Following is the performance comparison:

{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   48.13  (4.4%)   15.38  (1.0%)  
-68.0% ( -70% -  -65%)
  Fuzzy2   51.30  (5.3%)   17.47  (1.3%)  
-65.9% ( -68% -  -62%)
  Fuzzy1   52.24  (4.0%)   18.50  (1.2%)  
-64.6% ( -67% -  -61%)
Wildcard9.31  (1.7%)6.16  (2.2%)  
-33.8% ( -37% -  -30%)
 Prefix3   23.25  (1.8%)   19.00  (2.2%)  
-18.3% ( -21% -  -14%)
PKLookup  244.92  (3.6%)  225.42  (2.3%)   
-8.0% ( -13% -   -2%)
 LowTerm  295.88  (5.5%)  293.27  (4.8%)   
-0.9% ( -10% -9%)
  HighPhrase   13.62  (6.5%)   13.54  (7.4%)   
-0.6% ( -13% -   14%)
 MedTerm   99.51  (7.8%)   99.19  (7.7%)   
-0.3% ( -14% -   16%)
   MedPhrase  154.63  (9.4%)  154.38 (10.1%)   
-0.2% ( -17% -   21%)
HighTerm   28.25 (10.7%)   28.25 (10.0%)   
-0.0% ( -18% -   23%)
  OrHighHigh   16.83 (13.3%)   16.86 (13.1%)
0.2% ( -23% -   30%)
HighSloppyPhrase9.02  (4.4%)9.03  (4.5%)
0.2% (  -8% -9%)
   LowPhrase6.26  (3.4%)6.27  (4.1%)
0.2% (  -7% -8%)
   OrHighMed   13.73 (13.2%)   13.77 (12.8%)
0.3% ( -22% -   30%)
   OrHighLow   25.65 (13.2%)   25.73 (13.0%)
0.3% ( -22% -   30%)
 MedSloppyPhrase6.63  (2.7%)6.66  (2.7%)
0.5% (  -4% -6%)
  AndHighMed   42.77  (1.8%)   43.13  (1.5%)
0.8% (  -2% -4%)
 LowSloppyPhrase   32.68  (3.0%)   32.96  (2.8%)
0.8% (  -4% -6%)
 AndHighHigh   22.90  (1.2%)   23.18  (0.7%)
1.2% (   0% -3%)
 LowSpanNear   29.30  (2.0%)   29.83  (2.2%)
1.8% (  -2% -6%)
 MedSpanNear8.39  (2.7%)8.56  (2.9%)
2.0% (  -3% -7%)
  IntNRQ3.12  (1.9%)3.18  (6.7%)
2.1% (  -6% -   10%)
  AndHighLow  507.01  (2.4%)  522.10  (2.8%)
3.0% (  -2% -8%)
HighSpanNear5.43  (1.8%)5.60  (2.6%)
3.1% (  -1% -7%)
{noformat}


{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   49.24  (2.7%)   15.51  (1.0%)  
-68.5% ( -70% -  -66%)
  Fuzzy2   52.01  (4.8%)   17.61  (1.4%)  
-66.1% ( -68% -  -63%)
  Fuzzy1   53.00  (4.0%)   18.62  (1.3%)  
-64.9% ( -67% -  -62%)
Wildcard9.37  (1.3%)6.15  (2.1%)  
-34.4% ( -37% -  -31%)
 Prefix3   23.36  (0.8%)   18.96  (2.1%)  
-18.8% ( -21% -  -16%)
   MedPhrase  155.86  (9.8%)  152.34  (9.7%)   
-2.3% ( -19% -   19%)
   LowPhrase6.33  (3.7%)6.23  (4.0%)   
-1.6% (  -8% -6%)
  HighPhrase   13.68  (7.2%)   13.49  (6.8%)   
-1.4% ( -14% -   13%)
   OrHighMed   13.78 (13.0%)   13.68 (12.7%)   
-0.8% ( -23% -   28%)
HighSloppyPhrase9.14  (5.2%)9.07  (3.7%)   
-0.7% (  -9% -8%)
  OrHighHigh   16.87 (13.3%)   16.76 (12.9%)   
-0.6% ( -23% -   29%)
   OrHighLow   25.71 (13.1%)   

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-04-26 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642971#comment-13642971
 ] 

Han Jiang edited comment on LUCENE-3069 at 4/26/13 6:08 PM:


This is my inital proposal for this project: 
https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/34001
I'm looking forward to your feedbacks. :)

  was (Author: billy):
This is my inital proposal for this project: 
https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/32001
I'm looking forward to your feedbacks. :)
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2013
 Fix For: 4.3


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org