On Fri, Jan 24, 2014 at 4:08 PM, Nikolas Everett <[email protected]> wrote:

> I just happened upon a something that needs the sort parameter and i
> figured I should look it up and I saw that the field is loaded into
> memory.  My concern was that it'd be possible (but useless) to construct a
> query that matches all documents and then ask Elasticsearch to sort them
> all.  In effect, pulling that particular field into memory.  So my question
> was, is there a way to limit the number of documents that need that field
> pulled into memory?
>
> Suppose I have a million documents per shard and the field I'm sorting on
> takes an average of a hundred bytes, that means I'm having to slurp 100M of
> stuff into memory.  That isn't quick and consumes 1/300th of the heap on
> the node just for one shard.  In my case I'd prefer to just sort the first
> ten thousand documents and warn them that the sorting wasn't wholly
> accurate.  I suppose I could execute a count and if the count comes back
> too high then refuse to do the search at all but that seems less pleasant.
> I suppose I have the same feeling about faceting as well.  And, yeah, I'm
> not being clear about what "the first" really means because I haven't
> really thought that part through.
>

OK, I see what you mean. For reference, if the field is 100 bytes and there
are 1M documents, this will not necessary require 100M as values are
deduplicated in field data. Additionally, there is a FST based field data
implementation that can help save even more memory when values share
prefixes and/or suffixes.

I think your idea is doable, we could trade accuracy for memory in the case
of sorting. We actually have something similar for faceting: it is possible
to only load into memory values that have high-enough frequencies. I guess
there could be something similar for sorting by only loading into memory
the first n bytes of the field that is used for sorting? (Potentially with
a few variants eg. to be able to load the exact terms for the least ones
for better accuracy)


> I did poke around the implementation and I saw that it loads the terms
> into memory for each segment.  I didn't see where it unpins the loaded
> terms, though. Does it unpin them when it is done with the segment?
>

If you look into IndexFieldDataCache, there is a call to
SegmentReaderUtils.registerCoreListener. This will cause field data to be
unloaded when the segment is closed.

-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_QuJsgTAzTd8jxE-T8tfJ952ZfsVDJqxPn5tGqv6NLw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to