[ 
https://issues.apache.org/jira/browse/LUCENE-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389815#comment-15389815
 ] 

Michael McCandless commented on LUCENE-7390:
--------------------------------------------

bq. I have a little concern about this being fairly sizeable amount of ram

Yeah I agree...

But, with this change, we allow each flushing segment to use up to 1/8th of 
IW's buffer, or 16 MB, whichever is larger, in temp space.  Remember that this 
is transient usage: after that sort and the points are written, it's freed.  
It's not unlike how merging uses temp space to map around deleted doc IDs, or 
in-flight flushing segments tie up temp space until they finish writing.  I 
think IW has a right to use temp space beyond the "long term" indexing buffer 
... I'll try to improve IWC's javadocs here, explaining that this is not a hard 
limit.

bq. It is a little annoying that performance is so sensitive to this change, we 
should look into that more somehow. Maybe we can improve it so it does not need 
so much RAM.

I already made quite a few optimizations here, but I agree we could do more, 
e.g. don't always do a secret {{forceMerge}} in {{OfflineSorter}} 
(LUCENE-7141), but that got sort of complicated when I last tried...

I think the discontinuity, moving from a single in-heap sort, to "serialize to 
disk", "read 2 partitions and sort those in heap", "write those partitions to 
disk", "do a final merge sort of those 2 partitions to another file", is the 
big hit, and I agree it would be great to find a way to reduce that cost.

> Let BKDWriter use temp heap for sorting points in proportion to IndexWriter's 
> indexing buffer
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7390
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: master (7.0), 6.2
>
>         Attachments: LUCENE-7390.patch
>
>
> With Lucene's default codec, when writing dimensional points, we only give 
> {{BKDWriter}} 16 MB heap to use for sorting, regardless of how large IW's 
> indexing buffer is.  A custom codec can change this but that's a little steep.
> I've been testing indexing performance on a points-heavy dataset, 1.2 billion 
> taxi rides from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml 
> , indexing with a 1 GB IW buffer, and the small 16 MB heap limit causes clear 
> performance problems because flushing the large segments forces {{BKDwriter}} 
> to switch to offline sorting which causes the DWPTs take too long to flush.  
> They then fall behind, and Lucene does a hard stall on incoming indexing 
> threads until they catch up.
> [~rcmuir] had a simple idea to let IW pass the allowed temp heap usage to 
> {{PointsWriter.writeField}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to