[ 
https://issues.apache.org/jira/browse/LUCENE-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4512:
---------------------------------

    Attachment: LUCENE-4512.patch

I did some tests with the 1K docs from the wikipedia dump:
 - always 16 or 17 bpvs for start pointers, (my intuition was wrong! :-))
 - the CompressingStoredFieldsIndex instance is 185.3KB (measured with 
RamusageEstimator) for 1M docs (0.19 bytes per doc, 3.24 bytes per chunk).

I tried some other block sizes:
 - 256 : 189.2KB
 - 4096 : 204.8 KB

1024 looks like a good setting.

bq. I was just thinking simpler code in the reader.

Hmm good point. It is true that it is already complex enough... Here is a new 
patch.

bq. And once you get all this baked in aren't you itching to do the vectors 
files too?

I started thinking to it but I'm not very familiar with the terms vectors file 
formats yet. There are probably other places that might benefit from 
compression (terms dictionary?).
                
> Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4512
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4512
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4512.patch, LUCENE-4512.patch
>
>
> Robert had a great idea to save memory with 
> {{CompressingStoredFieldsIndex.MEMORY_CHUNK}}: instead of storing the 
> absolute start pointers we could compute the mean number of bytes per chunk 
> of documents and only store the delta between the actual value and the 
> expected value (avgChunkBytes * chunkNumber).
> By applying this idea to every n(=1024?) chunks, we would even:
>  - make sure to never hit the worst case (delta ~= maxStartPointer)
>  - reduce memory usage at indexing time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to