On Fri, Jan 24, 2014 at 4:08 PM, Nikolas Everett <[email protected]> wrote:
> I just happened upon a something that needs the sort parameter and i > figured I should look it up and I saw that the field is loaded into > memory. My concern was that it'd be possible (but useless) to construct a > query that matches all documents and then ask Elasticsearch to sort them > all. In effect, pulling that particular field into memory. So my question > was, is there a way to limit the number of documents that need that field > pulled into memory? > > Suppose I have a million documents per shard and the field I'm sorting on > takes an average of a hundred bytes, that means I'm having to slurp 100M of > stuff into memory. That isn't quick and consumes 1/300th of the heap on > the node just for one shard. In my case I'd prefer to just sort the first > ten thousand documents and warn them that the sorting wasn't wholly > accurate. I suppose I could execute a count and if the count comes back > too high then refuse to do the search at all but that seems less pleasant. > I suppose I have the same feeling about faceting as well. And, yeah, I'm > not being clear about what "the first" really means because I haven't > really thought that part through. > OK, I see what you mean. For reference, if the field is 100 bytes and there are 1M documents, this will not necessary require 100M as values are deduplicated in field data. Additionally, there is a FST based field data implementation that can help save even more memory when values share prefixes and/or suffixes. I think your idea is doable, we could trade accuracy for memory in the case of sorting. We actually have something similar for faceting: it is possible to only load into memory values that have high-enough frequencies. I guess there could be something similar for sorting by only loading into memory the first n bytes of the field that is used for sorting? (Potentially with a few variants eg. to be able to load the exact terms for the least ones for better accuracy) > I did poke around the implementation and I saw that it loads the terms > into memory for each segment. I didn't see where it unpins the loaded > terms, though. Does it unpin them when it is done with the segment? > If you look into IndexFieldDataCache, there is a call to SegmentReaderUtils.registerCoreListener. This will cause field data to be unloaded when the segment is closed. -- Adrien Grand -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_QuJsgTAzTd8jxE-T8tfJ952ZfsVDJqxPn5tGqv6NLw%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
