[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2014-01-29 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13885834#comment-13885834
 ] 

Otis Gospodnetic commented on LUCENE-2082:
--

{[~whzz] are you still working on this by any chance?

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2014
 Fix For: 4.7


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2013-06-14 Thread Dmitry Kan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13683303#comment-13683303
 ] 

Dmitry Kan commented on LUCENE-2082:


hi [~whzz],

Would you be potentially interested in other postings lists idea that came up 
recently?

http://markmail.org/message/6ro7bbez3v3y5mfx#query:+page:1+mid:tywtrjjcfdbzww6f+state:results

It can be of quite high impact on the index size and hopefully relatively easy 
to start an experiment using the lucene codec technology.

Just in case you would get interested.

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2013
 Fix For: 4.4


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2013-06-13 Thread Aleksandra Wozniak (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682805#comment-13682805
 ] 

Aleksandra Wozniak commented on LUCENE-2082:


Additionally, it seems to me that the second change from the description above 
is no longer needed since DocsEnum implementations just return NO_MORE_DOCS to 
indicate that the end of postings list is reached.

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2013
 Fix For: 4.4


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2013-06-06 Thread Aleksandra Wozniak (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676914#comment-13676914
 ] 

Aleksandra Wozniak commented on LUCENE-2082:


Michael,

thank you for your response. It really seems like a lot of work but hopefully I 
could do at least a part of it.

After digging a little bit in the MultiLevelSkipListReader/Writer 
implementation, I have a doubt about the first point of the above description:

Currently the multilevel skiplist reader/writer can only deal with fixed-size 
skips (16 on the lowest level). It would be an easy change to allow 
variable-size skips

Does variable-size skips mean that on each skip level we want to allow for 
different intervals between skip entries? For example, does a skip list like 
this would be correct?

posting list: doc0-doc1-doc2-doc3-...-doc29 (30 entries)
skip level 0: doc3-doc5-doc10-doc12-doc16-doc21-doc27 (intervals range from 2 
to 6)
skip level 1: doc5-doc27 (2nd and 7th position of the previous level)

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2013
 Fix For: 4.4


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, 

[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2013-04-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635788#comment-13635788
 ] 

Michael McCandless commented on LUCENE-2082:


Hi Aleksandra,

I don't think anyone is working on this now ... it'd be quite a bit of work!

The classes have changed names but the core idea is the same.  Have a look at 
PostingsFormat: that's the Codec component that handles reading/writing/merging 
of all postings files (terms dict, docs/freqs/positions/offsets).  It seems 
like for this issue you'd need to override Fields/Terms/PostingsConsumer.merge 
methods.

But some things here will likely require changes outside of Codec, eg today we 
always remove deletes while merging, but for this issue it looks like you may 
want to have a threshold below which the deletes are not removed...

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2013
 Fix For: 4.3


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

2013-04-16 Thread Aleksandra Wozniak (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1363#comment-1363
 ] 

Aleksandra Wozniak commented on LUCENE-2082:


Is anyone actively working on this improvement? If not, I was wondering if I 
could contribute -- I'm a graduate CS student and for my final project I'm 
investigating ways to optimize merging and index creation time of text indices. 
During my research I recently came across this ticket and I thought that I'd 
like to implement this idea.

I took a first look at the source code -- it seems that the codebase changed 
significantly since the time this issue was created. From what I found in the 
documentation (http://lucene.apache.org/core/4_0_0/MIGRATE.html), many classes 
mentioned here changed their names and API. Could you please update the 
description above or comment on what parts of it are no longer valid?

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael Busch
Priority: Minor
  Labels: gsoc2013
 Fix For: 4.3


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (LUCENE-2082) Performance improvement for merging posting lists

2009-11-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781027#action_12781027
 ] 

Michael McCandless commented on LUCENE-2082:


This sounds like a neat idea!  For building up a new index (no deletions) it 
ought to be a sizable performance gain.

I just committed changes on the flex branch to make it possible for codecs to 
override merging...

 Performance improvement for merging posting lists
 -

 Key: LUCENE-2082
 URL: https://issues.apache.org/jira/browse/LUCENE-2082
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Priority: Minor
 Fix For: 3.1


 A while ago I had an idea about how to improve the merge performance
 for posting lists. This is currently by far the most expensive part of
 segment merging due to all the VInt de-/encoding. Not sure if an idea
 for improving this was already mentioned in the past?
 So the basic idea is it to perform a raw copy of as much posting data
 as possible. The reason why this is difficult is that we have to
 remove deleted documents. But often the fraction of deleted docs in a
 segment is rather low (10%?), so it's likely that there are quite
 long consecutive sections without any deletions.
 To find these sections we could use the skip lists. Basically at any
 point during the merge we would find the skip entry before the next
 deleted doc. All entries to this point can be copied without
 de-/encoding of the VInts. Then for the section that has deleted docs
 we perform the normal way of merging to remove the deletes. Then we
 check again with the skip lists if we can raw copy the next section.
 To make this work there are a few different necessary changes:
 1) Currently the multilevel skiplist reader/writer can only deal with 
 fixed-size
 skips (16 on the lowest level). It would be an easy change to allow
 variable-size skips, but then the MultiLevelSkipListReader can't
 return numSkippedDocs anymore, which SegmentTermDocs needs - change 2)
 2) Store the last docID in which a term occurred in the term
 dictionary. This would also be beneficial for other use cases. By
 doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
 the end of the postinglist is reached. Currently they have to track
 the df, which is why after a skip it's important to take the
 numSkippedDocs into account.
 3) Change the merging algorithm according to my description above. It's
 important to create a new skiplist entry at the beginning of every
 block that is copied in raw mode, because its next skip entry's values
 are deltas from the beginning of the block. Also the very first posting, and
 that one only, needs to be decoded/encoded to make sure that the
 payload length is explicitly written (i.e. must not depend on the
 previous length). Also such a skip entry has to be created at the
 beginning of each source segment's posting list. With change 2) we don't
 have to worry about the positions of the skip entries. And having a few
 extra skip entries in merged segments won't hurt much.
 If a segment has no deletions at all this will avoid any
 decoding/encoding of VInts (best case). I think it will also work
 great for segments with a rather low amount of deletions. We should
 probably then have a threshold: if the number of deletes exceeds this
 threshold we should fall back to old style merging.
 I haven't implemented any of this, so there might be complications I
 haven't thought about. Please let me know if you can think of reasons
 why this wouldn't work or if you think more changes are necessary.
 I will probably not have time to work on this soon, but I wanted to
 open this issue to not forget about it :). Anyone should feel free to
 take this!
 Btw: I think the flex-indexing branch would be a great place to try this
 out as a new codec. This would also be good to figure out what APIs
 are needed to make merging fully flexible as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org