[jira] [Updated] (LUCENE-4599) Compressed term vectors

Adrien Grand (JIRA) Fri, 18 Jan 2013 16:52:14 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-4599:
---------------------------------

    Attachment: LUCENE-4599.patch

New patch with tests, addProx and specialized merging. I think it is ready. 
This patch is similar to the previous ones except that it uses LZ4 compression 
on top of "prefix compression" (similarly to Lucene40TermVectorsFormat which 
writes the common prefix length with the previous term as a VInt before each 
term) instead of the raw term bytes to improve the compression ratio and relies 
on LUCENE-4643 for most integer encoding instead of raw packed ints. Otherwise:
 - vectors are still compressed into blocks of 16 KB,
 - looking up term vectors requires at most 1 disk seek.

Here are the size reductions of the term vector files depending on the size of 
the input docs:

|| Field options / Document size || 1 KB (a few tens of docs per chunk) || 750 
KB (one doc per chunk) ||
| none | 37% | 32% |
| positions | 32% | 10% |
| offsets | 41% | 31% |
| positions+offsets | 40% | 35% |

Regarding speed, indexing seems to be slightly slower but maybe the diminution 
of the size of the vector files would make merging faster when not everything 
fits in the I/O cache. I also ran a simple benchmark that loads term vectors 
for every doc of the index and iterates over all terms and positions. This new 
format was ~5x slower for small docs (likely because it has to decode the whole 
chunk even to read a single doc) and between 1.5x and 2x faster for large docs 
that are alone in their chunk (again, results would very likely be better on a 
large index which wouldn't fully fit in the O/S cache).

If someone with very large term vector files wanted to test this new format, 
this would be great! I'll try on my side to perform more indexing/highlighting 
benchmarks..
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.2
>
>         Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4599) Compressed term vectors

Reply via email to