[jira] Commented: (LUCENE-2340) FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list

Renaud Delbru (JIRA) Mon, 22 Mar 2010 18:07:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848484#action_12848484
 ]


Renaud Delbru commented on LUCENE-2340:
---------------------------------------

{quote}
The block is "shared" across postings, so a rare posting list in an otherwise 
big segment should be fine?
{quote}

Ok, I didn't got that point until now. Small lists will be inlined with bigger 
lists over a block.
However, it means that, most of the time after a seek, you will have to decode 
a block when only a part of it is of interest (e.g., decode the block, skip the 
end of the previous posting list, in order to access the start of the posting 
list you are looking for). But this overhead stays minimal since it will occur 
only once per query term times the number of posting files. And this overhead 
is even reduced more if the block compression algorithm is easily "seekable" 
like for FOR.
However, I am not sure about the consequences on algorithms like FOR or PFOR 
which determine the best compression configuration for a block given its list 
of integers. Certain blocks could be less well compressed since its list of 
integers could follow more than one distribution. It could be something 
interesting to try (or are you aware of any kind of experiments or benchmarks 
that show the benefits/disadvantages of these two approaches ?).

{quote}
Small segments will indeed be wasteful, but they'll presumably quickly be 
merged away.
{quote}

Yes, they will probably disappear after the first merge.

{quote}
Would other less-silly impls also need to do this? Ie the thing I want to avoid 
is foisting onto all block-based codecs the need to track the size of every 
block...
{quote}

They would have to do something similar, but a clever implementation could 
reduce well the overhead.

For example, the core algorithm of Simple9, FOR, PFOR are completely 
compatible. The only changes I had to made is in *IndexInput#readBlock and 
*IndexOutput#flushBlock.
I set the lower bit of numBytes to 0 when the block to write does not have the 
same size than the previous block. and encode its length (in IndexOutput). 
Then, in IndexInput, whenever I read a numBytes with the lower bit to 0, I read 
the length of the block, and pass it to #setUnCompressedData. Since current 
block algorithms relies on the size of the uncompressed data to terminate their 
loop, they adapt automatically to the new block size.

But, indeed, after your comments, the gain seems very minimal, since an 
incomplete block will only happen once per segment file. The technique I am 
using was for the case where a block is not "shared" with other posting list.
But, in the current case, this is not very useful and will add a slight 
overhead for very large segment file (since I need one bit per block * number 
of blocks for tracking the end of the block list).

Thanks for the clarification Michael, and I think you can close the issue, 
since I am not really sure that my patch is worth keeping given the current way 
you implemented block-based index.

> FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-2340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2340
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Renaud Delbru
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-1458-FixedIntBlockIndexOutput.patch, 
> LUCENE-1458-FixedIntBlockIndexOutput.patch
>
>
> At closing time, the current FixedIntBlockIndexOutput flushes blocks of 
> blockSize even if there is only a few integers in the block.
> This can be problematic and causes a big overhead when using large blockSize 
> (e.g., 1024), on small segments or on rare term posting list. 
> One solution will be to have a secondary flushBlock method with an additional 
> paramter: the valid length of a buffer. This method will be only called in 
> the FixedIntBlockIndexOutput#close() method.
> The way this particular block of integers are encoded are left to subclasses.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2340) FixedIntBlockIndexOutput encodes unnecessary integers at the end of a list

Reply via email to