[ 
https://issues.apache.org/jira/browse/LUCENE-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487181#comment-13487181
 ] 

Robert Muir commented on LUCENE-4512:
-------------------------------------

{quote}
That is good but I was expecting the distance from average (128kb here) to be 
less than the chunk size (16kb), which is clearly not the case. Is there 
anything in the dataset that could explain why chunk sizes vary so much? Or 
maybe we should just decrease the block size or the average is wrongly 
computed...
{quote}

Probably, i bet rows from the same country and even provinces within a country 
are typically grouped together?
Though before this jira issue, i did experiments randomizing the dataset with 
sort -r and it didnt make much difference...

In all cases you can get it from 
http://download.geonames.org/export/dump/allCountries.zip
Its UTF-8 and you can parse with split("\t")

{quote}
Good question. Encoding deltas currently requires 14 or 15 bits per values 
(because it can grow a little larger than the chunk size which is 2^14) so it 
is still a little more compact, and it is less prone to worst cases I think? 
There is some overhead at read time to build the packed ints array instead of 
just deserializing it but I think this is negligible. If we manage to make bpvs 
smaller than 14 on "standard" datasets then I think it makes sense.
{quote}

Well i wasnt really thinking about a few smaller bits on disk... if we want 
that, LZ4 this "metadata stuff" too (just kidding!).

I was just thinking simpler code in the reader.
                
> Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4512
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4512
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4512.patch
>
>
> Robert had a great idea to save memory with 
> {{CompressingStoredFieldsIndex.MEMORY_CHUNK}}: instead of storing the 
> absolute start pointers we could compute the mean number of bytes per chunk 
> of documents and only store the delta between the actual value and the 
> expected value (avgChunkBytes * chunkNumber).
> By applying this idea to every n(=1024?) chunks, we would even:
>  - make sure to never hit the worst case (delta ~= maxStartPointer)
>  - reduce memory usage at indexing time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to