[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to nodes can be 'merged'. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to nodes can be 'merged'. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:35 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 4:09 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats, it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708638#comment-13708638 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 5:08 PM: Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash.java:128). Although later, for nodes with outputs, it'll hardly find a same node from hashtable. was (Author: billy): Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash:128). Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:04 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:00 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, --I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small)-- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, -I think the index size we can reduce is about 67.5MB- -(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, --I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small)-- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707780#comment-13707780 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 4:48 PM: Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below(KB): {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? was (Author: billy): Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below: {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706703#comment-13706703 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 1:42 AM: Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). * long[] (sortable metadata) * byte[] (unsortable, generic metadata) * df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: {noformat} pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard9.31 (1.7%)6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% -9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase9.02 (4.4%)9.03 (4.5%) 0.2% ( -8% -9%) LowPhrase6.26 (3.4%)6.27 (4.1%) 0.2% ( -7% -8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase6.63 (2.7%)6.66 (2.7%) 0.5% ( -4% -6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% -4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% -6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% -3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% -6%) MedSpanNear8.39 (2.7%)8.56 (2.9%) 2.0% ( -3% -7%) IntNRQ3.12 (1.9%)3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% -8%) HighSpanNear5.43 (1.8%)5.60 (2.6%) 3.1% ( -1% -7%) {noformat} {noformat} pure TempFST vs. pure Lucene41, on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard9.37 (1.3%)6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase6.33 (3.7%)6.23 (4.0%) -1.6% ( -8% -6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase9.14 (5.2%)9.07 (3.7%) -0.7% ( -9% -8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1%)
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642971#comment-13642971 ] Han Jiang edited comment on LUCENE-3069 at 4/26/13 6:08 PM: This is my inital proposal for this project: https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/34001 I'm looking forward to your feedbacks. :) was (Author: billy): This is my inital proposal for this project: https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/billybob/32001 I'm looking forward to your feedbacks. :) Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2013 Fix For: 4.3 FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org