[
https://issues.apache.org/jira/browse/LUCENE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848484#action_12848484
]
Renaud Delbru commented on LUCENE-2340:
---------------------------------------
{quote}
The block is "shared" across postings, so a rare posting list in an otherwise
big segment should be fine?
{quote}
Ok, I didn't got that point until now. Small lists will be inlined with bigger
lists over a block.
However, it means that, most of the time after a seek, you will have to decode
a block when only a part of it is of interest (e.g., decode the block, skip the
end of the previous posting list, in order to access the start of the posting
list you are looking for). But this overhead stays minimal since it will occur
only once per query term times the number of posting files. And this overhead
is even reduced more if the block compression algorithm is easily "seekable"
like for FOR.
However, I am not sure about the consequences on algorithms like FOR or PFOR
which determine the best compression configuration for a block given its list
of integers. Certain blocks could be less well compressed since its list of
integers could follow more than one distribution. It could be something
interesting to try (or are you aware of any kind of experiments or benchmarks
that show the benefits/disadvantages of these two approaches ?).
{quote}
Small segments will indeed be wasteful, but they'll presumably quickly be
merged away.
{quote}
Yes, they will probably disappear after the first merge.
{quote}
Would other less-silly impls also need to do this? Ie the thing I want to avoid
is foisting onto all block-based codecs the need to track the size of every
block...
{quote}
They would have to do something similar, but a clever implementation could
reduce well the overhead.
For example, the core algorithm of Simple9, FOR, PFOR are completely
compatible. The only changes I had to made is in *IndexInput#readBlock and
*IndexOutput#flushBlock.
I set the lower bit of numBytes to 0 when the block to write does not have the
same size than the previous block. and encode its length (in IndexOutput).
Then, in IndexInput, whenever I read a numBytes with the lower bit to 0, I read
the length of the block, and pass it to #setUnCompressedData. Since current
block algorithms relies on the size of the uncompressed data to terminate their
loop, they adapt automatically to the new block size.
But, indeed, after your comments, the gain seems very minimal, since an
incomplete block will only happen once per segment file. The technique I am
using was for the case where a block is not "shared" with other posting list.
But, in the current case, this is not very useful and will add a slight
overhead for very large segment file (since I need one bit per block * number
of blocks for tracking the end of the block list).
Thanks for the clarification Michael, and I think you can close the issue,
since I am not really sure that my patch is worth keeping given the current way
you implemented block-based index.
> FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list
> --------------------------------------------------------------------------
>
> Key: LUCENE-2340
> URL: https://issues.apache.org/jira/browse/LUCENE-2340
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: Flex Branch
> Reporter: Renaud Delbru
> Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-1458-FixedIntBlockIndexOutput.patch,
> LUCENE-1458-FixedIntBlockIndexOutput.patch
>
>
> At closing time, the current FixedIntBlockIndexOutput flushes blocks of
> blockSize even if there is only a few integers in the block.
> This can be problematic and causes a big overhead when using large blockSize
> (e.g., 1024), on small segments or on rare term posting list.
> One solution will be to have a secondary flushBlock method with an additional
> paramter: the valid length of a buffer. This method will be only called in
> the FixedIntBlockIndexOutput#close() method.
> The way this particular block of integers are encoded are left to subclasses.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]