[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

Michael McCandless (JIRA) Thu, 17 Jan 2019 08:20:13 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745253#comment-16745253
 ]


Michael McCandless commented on LUCENE-8635:
--------------------------------------------

OK thanks [~sokolov].  I'll try to also run bench on wikibig and report back.  
I think doing a single method call instead of the two (seek + read) via 
{{RandomAccessInput}} must be helping.
{quote}The thing that makes me want to be careful here is that access to the 
terms index is very random, so things might degrade badly if the OS cache 
doesn't hold the whole terms index in memory.
{quote}
I think net/net we are already relying on OS to do the right thing here.  As 
things stand today, the OS could also swap out the heap pages that hold the 
FST's {{byte[]}} depending on its swappiness (on Linux). 
{quote}I'm not super familiar with the FST internals, I wonder whether there 
are changes that we could make to it so that it would be more disk-friendly, 
eg. by seeking backward as little as possible when looking up a key?
{quote}
{{We used to have a }}{{pack}} method in FST that would 1) try to further 
compress the {{byte[]}} size by moving nodes "closer" to the nodes that 
transitioned to them, and 2) reversing the bytes.  But we removed that method 
because it added complexity and nobody was really using it and sometimes it 
even made the FST bigger!

Maybe, we could bring the method back, but only part 2) of it, and always call 
it at the end of building an FST?  That should be simpler code (without part 
1), and should achieve sequential reads of at least the bytes to decode a 
single transition; maybe it gives a performance jump independent of this 
change?  But I think we really should explore that independently of this issue 
... I think as long as additional performance tests show only these smallish 
impacts to real queries we should just make the change across the board for 
terms dictionary index?

> Lazy loading Lucene FST offheap using mmap
> ------------------------------------------
>
>                 Key: LUCENE-8635
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8635
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/FSTs
>         Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>            Reporter: Ankit Jain
>            Priority: Major
>         Attachments: offheap.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

Reply via email to