[
https://issues.apache.org/jira/browse/LUCENE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206494#comment-15206494
]
Michael McCandless commented on LUCENE-7122:
--------------------------------------------
I think people may be underestimating the priority of this issue:
{{OfflineSorter}}, now suddenly heavily used by Lucene's new dimensional
points, is like a baby Lucene: it pulls values into heap, up until its budget,
and then sorts them and writes another segment, having to merge them all in the
end. I have watched it go through its file slowly while indexing 3.2B OSM
points ;)
This patch would mean we can store 33% more {{IntPoint}} s in heap before
writing a segment, which is an amazing improvement, especially when it can mean
0 vs 1 merges needed, or 1 vs 2 merges needed, etc., for many use cases. No
matter how fast your SSD is, having to do 1 instead of 2 merges is a big win.
If I had a way to make Lucene's {{IndexWriter}} postings heap buffer 33% more
RAM efficient, that would be incredible :)
And yes I know {{OfflineSorter}} is also used by non-fixed-length users (e.g.
suggesters, and possibly/probably external users) but I think this new core
usage for numerics and geo of it is (suddenly) the most important usage of it
by Lucene.
[~dawid.weiss], do you disagree so much with the first patch that you would
veto it? If it's OK, I'd rather commit that approach, and open followon issues
to improve it later. I prefer that patch, since it adds no new
classes/interfaces, and (like Lucene's doc values) it hides all heap storage
optimizations under the hood. {{OfflineSorter}} is typically IO bound, so I
don't think we should fret about the added conditionals for the CPU.
> BytesRefArray can be more efficient for fixed width values
> ----------------------------------------------------------
>
> Key: LUCENE-7122
> URL: https://issues.apache.org/jira/browse/LUCENE-7122
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master, 6.1
>
> Attachments: LUCENE-7122.patch, LUCENE-7122.patch
>
>
> Today {{BytesRefArray}} uses one int ({{int[]}}, overallocated) per
> value to hold the length, but for dimensional points these values are
> always the same length.
> This can save another 4 bytes of heap per indexed dimensional point,
> which is a big improvement (more points can fit in heap at once) for
> 1D and 2D lat/lon points.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]