[
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990662#comment-12990662
]
Robert Muir edited comment on LUCENE-2905 at 2/11/11 11:58 AM:
---------------------------------------------------------------
Renaud thanks for the paper... I will spend some time trying to digest it!
But I think its always an option to try to reduce the number of files, too.
This is important also for # of open files and other practical reasons.
Mike a few questions:
{quote}
If we are REALLY sure keeping int alignment in these intblock
encoded files is not important (ie, we really do get best perf by
slurping in byte[] and then decoding from there), then we should also
store eg skip data into the frq/doc file (this is what Standard
does).
{quote}
Well, I measured this a lot, but why box ourselves out? As a first step
we can still keep the .skp file as-is, but it only needs point to
the start of the doc block in the frq/doc file.
{quote}
Maybe similarly interleave payload/positions packets?
{quote}
I think we should do something here. But when i started trying to draw
this up, I came to the conclusion that payload byte[]s should themselves
be actual terms (e.g. deduplicated), and we store some sort of ord to get
to them instead of bytes and length, etc. if they were themselves terms,
then they could also store forward postings back to their docs, and you
could query on payloads (attributes) efficiently too... but I know this
would be a fairly large change.
{quote}
Separately, I think we should break out "when skip is even
stored" vs "how frequently we index skip data".
{quote}
I agree, another reason to pull skipInterval completely codec-private.
Then a codec could itself have a separate "skipMinimum" too.
{quote}
For low DF terms w/in a block I think we shouldn't store their
pointers into the posting; instead, you should load an earlier term's
postings and scan over its postings. This should save tons of space
in the tib file.
{quote}
How would this work? Isnt everything right now in the .tib delta-encoded
against the index term? What if there are 'large' terms in between?
And for some queries like rangequery, wouldnt this create a little O(n^2)
of sorts? I don't think this is a big deal, most people should be using
e.g. NumericRangeQuery, and maybe we could still prevent it...?
was (Author: rcmuir):
Renaud thanks for the paper... I will spend some time trying to digest it!
But I think its always an option to try to reduce the number of files, too.
This is important also for # of open files and other practical reasons.
Mike a few questions:
bq. If we are REALLY sure keeping int alignment in these intblock
encoded files is not important (ie, we really do get best perf by
slurping in byte[] and then decoding from there), then we should also
store eg skip data into the frq/doc file (this is what Standard
does).
Well, I measured this a lot, but why box ourselves out? As a first step
we can still keep the .skp file as-is, but it only needs point to
the start of the doc block in the frq/doc file.
bq. Maybe similarly interleave payload/positions packets?
I think we should do something here. But when i started trying to draw
this up, I came to the conclusion that payload byte[]s should themselves
be actual terms (e.g. deduplicated), and we store some sort of ord to get
to them instead of bytes and length, etc. if they were themselves terms,
then they could also store forward postings back to their docs, and you
could query on payloads (attributes) efficiently too... but I know this
would be a fairly large change.
bq. Separately, I think we should break out "when skip is even
stored" vs "how frequently we index skip data".
I agree, another reason to pull skipInterval completely codec-private.
Then a codec could itself have a separate "skipMinimum" too.
bq. For low DF terms w/in a block I think we shouldn't store their
pointers into the posting; instead, you should load an earlier term's
postings and scan over its postings. This should save tons of space
in the tib file.
How would this work? Isnt everything right now in the .tib delta-encoded
against the index term? What if there are 'large' terms in between?
And for some queries like rangequery, wouldnt this create a little O(n^2)
of sorts? I don't think this is a big deal, most people should be using
e.g. NumericRangeQuery, and maybe we could still prevent it...?
> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
> Key: LUCENE-2905
> URL: https://issues.apache.org/jira/browse/LUCENE-2905
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Robert Muir
> Fix For: Bulk Postings branch
>
> Attachments: LUCENE-2905_simple64.patch,
> LUCENE-2905_skipIntervalMin.patch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable
> Intblock
> encodings, we have problems with both performance and index size versus
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq).
> * Having to store both a relative delta to the block start, and an offset
> into the block.
> * In a lot of cases various numbers involved are larger than they should be:
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep,
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into
> one file? the concrete impl could implement skipBlock() for when they only
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to
> the start of a block? Then, their skip pointers dont need to be a combination
> of delta + upto, because upto is always zero. What would we have to modify in
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based
> on block size (maybe by default its 1, meaning we can skip to the start of a
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use
> 'blockid' instead of file pointer so that we get smaller numbers? e.g.
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be
> blockid deltas from the index term, instead of fp deltas? This would be
> smaller numbers and we could compress this metadata better.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]