[ https://issues.apache.org/jira/browse/LUCENE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848484#action_12848484 ]
Renaud Delbru commented on LUCENE-2340: --------------------------------------- {quote} The block is "shared" across postings, so a rare posting list in an otherwise big segment should be fine? {quote} Ok, I didn't got that point until now. Small lists will be inlined with bigger lists over a block. However, it means that, most of the time after a seek, you will have to decode a block when only a part of it is of interest (e.g., decode the block, skip the end of the previous posting list, in order to access the start of the posting list you are looking for). But this overhead stays minimal since it will occur only once per query term times the number of posting files. And this overhead is even reduced more if the block compression algorithm is easily "seekable" like for FOR. However, I am not sure about the consequences on algorithms like FOR or PFOR which determine the best compression configuration for a block given its list of integers. Certain blocks could be less well compressed since its list of integers could follow more than one distribution. It could be something interesting to try (or are you aware of any kind of experiments or benchmarks that show the benefits/disadvantages of these two approaches ?). {quote} Small segments will indeed be wasteful, but they'll presumably quickly be merged away. {quote} Yes, they will probably disappear after the first merge. {quote} Would other less-silly impls also need to do this? Ie the thing I want to avoid is foisting onto all block-based codecs the need to track the size of every block... {quote} They would have to do something similar, but a clever implementation could reduce well the overhead. For example, the core algorithm of Simple9, FOR, PFOR are completely compatible. The only changes I had to made is in *IndexInput#readBlock and *IndexOutput#flushBlock. I set the lower bit of numBytes to 0 when the block to write does not have the same size than the previous block. and encode its length (in IndexOutput). Then, in IndexInput, whenever I read a numBytes with the lower bit to 0, I read the length of the block, and pass it to #setUnCompressedData. Since current block algorithms relies on the size of the uncompressed data to terminate their loop, they adapt automatically to the new block size. But, indeed, after your comments, the gain seems very minimal, since an incomplete block will only happen once per segment file. The technique I am using was for the case where a block is not "shared" with other posting list. But, in the current case, this is not very useful and will add a slight overhead for very large segment file (since I need one bit per block * number of blocks for tracking the end of the block list). Thanks for the clarification Michael, and I think you can close the issue, since I am not really sure that my patch is worth keeping given the current way you implemented block-based index. > FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list > -------------------------------------------------------------------------- > > Key: LUCENE-2340 > URL: https://issues.apache.org/jira/browse/LUCENE-2340 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: Flex Branch > Reporter: Renaud Delbru > Priority: Minor > Fix For: Flex Branch > > Attachments: LUCENE-1458-FixedIntBlockIndexOutput.patch, > LUCENE-1458-FixedIntBlockIndexOutput.patch > > > At closing time, the current FixedIntBlockIndexOutput flushes blocks of > blockSize even if there is only a few integers in the block. > This can be problematic and causes a big overhead when using large blockSize > (e.g., 1024), on small segments or on rare term posting list. > One solution will be to have a secondary flushBlock method with an additional > paramter: the valid length of a buffer. This method will be only called in > the FixedIntBlockIndexOutput#close() method. > The way this particular block of integers are encoded are left to subclasses. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org