[ 
https://issues.apache.org/jira/browse/LUCENE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206494#comment-15206494
 ] 

Michael McCandless commented on LUCENE-7122:
--------------------------------------------

I think people may be underestimating the priority of this issue:

{{OfflineSorter}}, now suddenly heavily used by Lucene's new dimensional 
points, is like a baby Lucene: it pulls values into heap, up until its budget, 
and then sorts them and writes another segment, having to merge them all in the 
end.  I have watched it go through its file slowly while indexing 3.2B OSM 
points ;)

This patch would mean we can store 33% more {{IntPoint}} s in heap before 
writing a segment, which is an amazing improvement, especially when it can mean 
0 vs 1 merges needed, or 1 vs 2 merges needed, etc., for many use cases.  No 
matter how fast your SSD is, having to do 1 instead of 2 merges is a big win.

If I had a way to make Lucene's {{IndexWriter}} postings heap buffer 33% more 
RAM efficient, that would be incredible :)

And yes I know {{OfflineSorter}} is also used by non-fixed-length users (e.g. 
suggesters, and possibly/probably external users) but I think this new core 
usage for numerics and geo of it is (suddenly) the most important usage of it 
by Lucene.

[~dawid.weiss], do you disagree so much with the first patch that you would 
veto it?  If it's OK, I'd rather commit that approach, and open followon issues 
to improve it later.  I prefer that patch, since it adds no new 
classes/interfaces, and (like Lucene's doc values) it hides all heap storage 
optimizations under the hood.  {{OfflineSorter}} is typically IO bound, so I 
don't think we should fret about the added conditionals for the CPU.

> BytesRefArray can be more efficient for fixed width values
> ----------------------------------------------------------
>
>                 Key: LUCENE-7122
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7122
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master, 6.1
>
>         Attachments: LUCENE-7122.patch, LUCENE-7122.patch
>
>
> Today {{BytesRefArray}} uses one int ({{int[]}}, overallocated) per
> value to hold the length, but for dimensional points these values are
> always the same length. 
> This can save another 4 bytes of heap per indexed dimensional point,
> which is a big improvement (more points can fit in heap at once) for
> 1D and 2D lat/lon points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to