Sorry for the delay between my twitter response and my reply here.

Basically, sorting first and then performing query/filter matches is not 
really a tenable solution, due to memory constraints.  If you were to sort 
first, you would need to sort the documents (which may be very expensive 
over say 5bn docs), and then maintain that sorted order in memory so you 
can perform the next query.  The memory overhead is the real reason why it 
won't work - maintaining that sort in memory is just not 
acceptable...especially if you consider fifty or a hundred concurrent 
search requests all trying to maintain the sort in memory.  

It would just fall apart because there is no way you can guarantee enough 
memory to satisfy the operation.  With the current arrangement, the query 
latency may increase as load increases, but you won't OOM when the number 
of queries hits a critical point

The way Elasticsearch executes queries is basically like this:


   1. Filters are executed and "mask" the index.  Only documents that match 
   the set of filters will be evaluated by the query.  Filter evaluation is 
   extremely fast...much faster than performing a sort.  Especially once the 
   filter is cached, it is basically bitwise operations
   2. The query evaluates the documents that match the filter and generates 
   a score
   3. This score is placed into a priority queue that is size "from" + 
   "size".  If you request "from:0" and "size:10", each shard maintains a 
   priority queue of size 10.  When documents are added to the priority queue, 
   the PQ will see if the score is greater than the least value in the queue. 
    If it is, the value is inserted and the least value is evicted.  PQs 
   guarantee the top N results based on the score.  So you can see that ES 
   isn't really "sorting" the results, it is just generating a score and 
   seeing if it is in the top N results.  This is why it can scale to billions 
   of docs.
   4. Since you are scoring by time, the score value returned for each 
   document is basically the timestamp
   5. These PQs are merged on the coordinating node

Could you post your query?  We may be able to help with optimizations, or 
suggest alternatives to speed it up like rescoring.  What query latency are 
you seeing, and what would you like it to be?  What does your system load 
and cluster look like?

As to your question about...we are investigating ways to change how data is 
stored in segments.  Currently the storage order is effectively random, 
because this is the most performant way to merge segments (since you don't 
need to care about order).  An alternative is to merge segments in some 
order, such as timestamp.  This would considerably slow down merging, but 
would speed up operations like time-series analysis.  We're looking into 
it, but nothing firm yet.

-Z


On Wednesday, March 19, 2014 6:45:43 AM UTC-5, David Pfeffer wrote:
>
>  I have an index that contains 30 GB worth of news stories. I want to 
> return the stories that contain a particular name in their text, sorted 
> chronologically. I only want the first 100 stories.
>
> ElasticSearch seems to approach this problem by filtering every story to 
> just those that match, then sorting those results and returning the top 
> 100. This uses a reasonably large amount of resources to filter every 
> single one.
>
> Can I get ElasticSearch to instead sort first, and then filter in order 
> until it reaches the maximum (100). Granted that this would be 100 per 
> shard, but then the final step would be to take each shard's 100, sort them 
> all together, and take the top 100 of that result set. This should, at 
> least in my mind, use significantly less resources, as it would only need 
> to go through maybe 5000 or 10000 items to find a match, as opposed to the 
> entirety of the index.
>
> *(Cross-posted 
> from 
> http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch
>  
> <http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch>,
>  
> because I didn't get an answer there for 2 days.)*
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/44a52c0b-10e7-4e73-b1cd-7112b5513d30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to