The issue with this workload is that it is very random I/O-intensive so I'm
afraid it might behave badly when your index grows larger than the size of
your filesystem cache. (This issue is not specific to Elasticsearch, any
data store would suffer from this issue when trying to fetch large numbers
of random records.)

That said, if your index is small and/or have lots of RAM for your
filesystem cache, this might still work well enough.

Regarding your question about sizes, Elasticsearch roughly does 1 or 2
random seeks per search term (in the inverted index) and 1 per returned
document. Since your sizes are large, running 20 queries with size=10K or 1
with size=200K doesn't change much wrt disk seeks as they are dominated by
the seeks to return documents.

However, memory-wise, Elasticsearch is going to be much happier if you run
more search requests with smaller sizes, so I would recommend running 20
queries with a size of 10K (or maybe even 200 with size=1K).


On Wed, Feb 19, 2014 at 9:56 PM, Josh Harrison <[email protected]> wrote:

> Darn ok. Thank you.
> If I'm retrieving large numbers of random largish (twitter river records)
> documents, is there a particular pattern I should use for searching? That
> is, does it make sense to send 20 sequential queries with size 10,000 and
> random sorting, or a single query with a size of 200,000? What about up
> into the millions? Obviously we're risking duplication of results when
> sending multiple smaller queries, but this is OK for our purposes, or can
> be dealt with at another stage of the process outside ES.
> Thanks,
> Josh
>
>
> On Wednesday, February 19, 2014 12:41:58 PM UTC-8, Adrien Grand wrote:
>
>> Hi Josh,
>>
>> In order to run efficiently, scan queries read records sequentially on
>> disk and keep a cursor that is used to maintain state between successive
>> pages. It would not be possible to get records in a random order as it
>> would not be possible to read sequentially anymore.
>>
>>
>> On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison <[email protected]> wrote:
>>
>>> I need to be able to pull 100s of thousands to millions of random
>>> documents from my indexes. Normally, to pull data this large I'd do a scan
>>> query, but they don't support sorting, so the suggestions I've seen online
>>> for randomizing your results don't work (such as those discussed here:
>>> http://stackoverflow.com/questions/9796470/random-
>>> order-pagination-elasticsearch).
>>> Is there a way to introduce randomness into a basic scan query?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%
>>> 40googlegroups.com.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>>
>> --
>> Adrien Grand
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/fabec423-97a6-4246-bf11-5d2899ca64b9%40googlegroups.com
> .
>
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j46Q5SBxrdX-WYDirBDQcbifQ2WtH%2BfzFJy%2BGpFCWWUNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to