[
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4599:
---------------------------------
Attachment: LUCENE-4599.patch
New patch with tests, addProx and specialized merging. I think it is ready.
This patch is similar to the previous ones except that it uses LZ4 compression
on top of "prefix compression" (similarly to Lucene40TermVectorsFormat which
writes the common prefix length with the previous term as a VInt before each
term) instead of the raw term bytes to improve the compression ratio and relies
on LUCENE-4643 for most integer encoding instead of raw packed ints. Otherwise:
- vectors are still compressed into blocks of 16 KB,
- looking up term vectors requires at most 1 disk seek.
Here are the size reductions of the term vector files depending on the size of
the input docs:
|| Field options / Document size || 1 KB (a few tens of docs per chunk) || 750
KB (one doc per chunk) ||
| none | 37% | 32% |
| positions | 32% | 10% |
| offsets | 41% | 31% |
| positions+offsets | 40% | 35% |
Regarding speed, indexing seems to be slightly slower but maybe the diminution
of the size of the vector files would make merging faster when not everything
fits in the I/O cache. I also ran a simple benchmark that loads term vectors
for every doc of the index and iterates over all terms and positions. This new
format was ~5x slower for small docs (likely because it has to decode the whole
chunk even to read a single doc) and between 1.5x and 2x faster for large docs
that are alone in their chunk (again, results would very likely be better on a
large index which wouldn't fully fit in the O/S cache).
If someone with very large term vector files wanted to test this new format,
this would be great! I'll try on my side to perform more indexing/highlighting
benchmarks..
> Compressed term vectors
> -----------------------
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs, core/termvectors
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 4.2
>
> Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with
> stored fields.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]