Re: Looking for a best practice to get all data according to some filters

Dani Castro Thu, 11 Dec 2014 11:25:08 -0800

Hi,
  I am facing the same situation:
We would like to get all the ids of the documents matching certain 
criteria. In the worst case (which is the one I am exposing here), the 
documents matching the criteria would be around 200K, and in our first 
tests it is really slow (around 15 seconds). However, if we do the same 
query just for count documents, ES replies in just 10-15ms, which is 
amazing.
I suspect that the problem is on the transport layer and the latency 
generated by transferring a big JSON result.


Would you recommend, in a situation like this, to use another transport 
layer like Thirf or a custom solution?.

Thanks in advance

El jueves, 11 de diciembre de 2014 14:00:05 UTC+1, Ron Sher escribió:
>
> Just tested this.
> When I used a large number to get all of my documents according to some 
> criteria (4926 in the result) I got:
> 13.951s when using a size of 1M
> 43.6s when using scan/scroll (with a size of 100)
>
> Looks like I should be using the not recommended paging.
> Can I make the scroll better?
>
> Thanks,
> Ron
>
> On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
>>
>> No I did not say that. Or I did not mean that. Sorry if it was unclear.
>> I said: don’t use large sizes:
>>
>> Never use size:10000000 or from:10000000. 
>>>
>>
>> You should read this: 
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan
>>
>> -- 
>> *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
>> <http://Elasticsearch.com>*
>> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr 
>> <https://twitter.com/elasticsearchfr> | @scrutmydocs 
>> <https://twitter.com/scrutmydocs>
>>
>>
>>  
>> Le 10 déc. 2014 à 21:16, Ron Sher <[email protected]> a écrit :
>>
>> So you're saying there's no impact on elasticsearch if I issue a large 
>> size? 
>> If that's the case then why shouldn't I just call size of 1M if I want to 
>> make sure I get everything?
>>
>> On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
>>>
>>> Scan/scroll is the best option to extract a huge amount of data.
>>> Never use size:10000000 or from:10000000. 
>>>
>>> It's not realtime because you basically scroll over a given set of 
>>> segments and all new changes that will come in new segments won't be taken 
>>> into account during the scroll.
>>> Which is good because you won't get inconsistent results.
>>>
>>> About size, I'd would try and test. It depends on your docs size I 
>>> believe.
>>> Try with 10000 and see how it goes when you increase it. You will may be 
>>> discover that getting 10*10000 docs is the same as 1*100000. :)
>>>
>>> Best
>>>
>>> David
>>>
>>> Le 10 déc. 2014 à 19:09, Ron Sher <[email protected]> a écrit :
>>>
>>> Hi,
>>>
>>> I was wondering about best practices to to get all data according to 
>>> some filters.
>>> The options as I see them are:
>>>
>>>    - Use a very big size that will return all accounts, i.e. use some 
>>>    value like 1m to make sure I get everything back (even if I need just a 
>>> few 
>>>    hundreds or tens of documents). This is the quickest way, development 
>>> wise.
>>>    - Use paging - using size and from. This requires looping over the 
>>>    result and the performance gets worse as we advance to later pages. 
>>> Also, 
>>>    we need to use preference if we want to get consistent results over the 
>>>    pages. Also, it's not clear what's the recommended size for each page.
>>>    - Use scan/scroll - this gives consistent paging but also has 
>>>    several drawbacks: If I use search_type=scan then it can't be sorted; 
>>> using 
>>>    scan/scroll is (maybe) less performant than paging (the documentation 
>>> says 
>>>    it's not for realtime use); again not clear which size is recommended.
>>>
>>> So you see - many options and not clear which path to take.
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> Ron
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dcadf7e0-1ba5-4b28-b193-13b7c3a5cabb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Looking for a best practice to get all data according to some filters

Reply via email to