[
https://issues.apache.org/jira/browse/LUCENE-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4512:
---------------------------------
Attachment: LUCENE-4512.patch
I did some tests with the 1K docs from the wikipedia dump:
- always 16 or 17 bpvs for start pointers, (my intuition was wrong! :-))
- the CompressingStoredFieldsIndex instance is 185.3KB (measured with
RamusageEstimator) for 1M docs (0.19 bytes per doc, 3.24 bytes per chunk).
I tried some other block sizes:
- 256 : 189.2KB
- 4096 : 204.8 KB
1024 looks like a good setting.
bq. I was just thinking simpler code in the reader.
Hmm good point. It is true that it is already complex enough... Here is a new
patch.
bq. And once you get all this baked in aren't you itching to do the vectors
files too?
I started thinking to it but I'm not very familiar with the terms vectors file
formats yet. There are probably other places that might benefit from
compression (terms dictionary?).
> Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
> ----------------------------------------------------------------------
>
> Key: LUCENE-4512
> URL: https://issues.apache.org/jira/browse/LUCENE-4512
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4512.patch, LUCENE-4512.patch
>
>
> Robert had a great idea to save memory with
> {{CompressingStoredFieldsIndex.MEMORY_CHUNK}}: instead of storing the
> absolute start pointers we could compute the mean number of bytes per chunk
> of documents and only store the delta between the actual value and the
> expected value (avgChunkBytes * chunkNumber).
> By applying this idea to every n(=1024?) chunks, we would even:
> - make sure to never hit the worst case (delta ~= maxStartPointer)
> - reduce memory usage at indexing time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]