Re: Scroll Questions

mooky Thu, 19 Jun 2014 05:56:34 -0700

Further to (2). Would it be an improvement to have a different kind of 
request for a scrolling search - that way the api could exclude items that 
don't make sense (e.g. aggregations, facets, etc)





On Wednesday, 18 June 2014 10:28:06 UTC+1, mooky wrote:
>
> Many thanks Jörg.
>
> Further questions/comments inline:
>  
>
>> 1. yes
>
>
> Thanks,
>
> 2. facet/aggregations are not very useful while scrolling (I doubt they 
>> even work at all) because scrolling works on shard level and aggregations 
>> work on indices level
>
>
> If they are not expected to work, would it make sense to either:
>
>    1. prevent aggregation/facet requests in conjunction with scroll 
>    requests (ie give an error to the user)
>    2. Simply not execute them? 
>
> If it doesn't make sense, would it be better to not return any 
> aggregation/facet results at all?
>
> 3. a scroll request takes resources. The purpose of ClearScrollRequest is 
>> to release those resources explicitly. This is indeed a rare situation when 
>> you need explicit clearing. The time delay of releasing scrolls implicitly 
>> can be controlled by the requests.
>
>
> Do you mean the keepAlive time? So, does the scroll (and its resources) 
> always remain for the duration of the keepAlive (since the last request on 
> that scroll) regardless of whether the end of the scroll was reached or not?
>
> I read the following (from the documentation) to imply that reading to the 
> end of the scroll had the effect of "aborting" and therefore cleaning up 
> resources.
>
> Besides consuming the scroll search until no hits has been returned a 
> scroll search can also be aborted by deleting the scroll_id
>
> So, just to confirm, reading to the end of the results does nothing in 
> terms of bringing about the cleanup of the scroll? Its either the TTL or 
> the ClearScrollRequest that brings about the cleanup of resources.
>
> Is there any downside to calling ClearScrollRequest explicitly?
> (I am inclined to call it explicitly when the end of the scroll is reached 
> in order clean up resources asap)
>
>
> 4. yes, the scroll id is an encoding of the combined state of all the 
>> shards that participate in the scroll. Even if the ID looks as if it has 
>> not changed, you should always use the latest reference to the scroll ID in 
>> the response, or you may clutter the nodes with unreleased scroll resources.
>
>
> Thanks for the explanation.
>
> A null scroll ID is a matter of API design. By using hit length check for 
>> 0, you can use the same condition for other queries, so it is convenient 
>> and not confusing. Null scroll IDs are always prone to NPEs.
>
>
> Agreed. Its a matter of API style/design.
> The only issue I have with checking hits.length is that depending on the 
> SearchType, sometimes hits.length==0 does not mean the end of the results 
> (e.g. SearchType.SCAN). Its the lack of consistency that bothers me about 
> it. It requires the code that handles results to be aware of a detail of 
> the request.
>
> My case for using scrollId is that:
> The scrollId is already null if no scroll is requested.
> For this reason, (IMO) scrollId==null would be a more consistent indicator 
> of no scrolling required - or no further scrolling required. Also it would 
> re-enforce the notion that the user should always use/observe the returned 
> scrollId - they would have to.
>
> Cheers,
> -Nick
>
>
> On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:
>>
>> 1. yes
>>
>> 2. facet/aggregations are not very useful while scrolling (I doubt they 
>> even work at all) because scrolling works on shard level and aggregations 
>> work on indices level
>>
>> 3. a scroll request takes resources. The purpose of ClearScrollRequest is 
>> to release those resources explicitly. This is indeed a rare situation when 
>> you need explicit clearing. The time delay of releasing scrolls implicitly 
>> can be controlled by the requests.
>>
>> 4. yes, the scroll id is an encoding of the combined state of all the 
>> shards that participate in the scroll. Even if the ID looks as if it has 
>> not changed, you should always use the latest reference to the scroll ID in 
>> the response, or you may clutter the nodes with unreleased scroll resources.
>>
>> Scrolling is very different from search, because there is a shard-level 
>> machinery that iterates over the Lucene segments and keep them open. This 
>> tends to ramp up lots of server-side resources, which may long-lived - a 
>> challenge for resource management. There is a reaper thread that wakes up 
>> from time to time to take care of stray scroll searches. You observed this 
>> as a "time delay". Ordinary search actions never keep resources open at 
>> shard level.
>>
>> Using scroll search for creating large CSV exports is adequate because 
>> this iterates through the result set doc by doc. But replacing a 
>> full-fledged search that has facets/filters/aggregations/sorting with a 
>> scroll search, you will only create large overheads (if it is even 
>> possible). 
>>
>> A null scroll ID is a matter of API design. By using hit length check for 
>> 0, you can use the same condition for other queries, so it is convenient 
>> and not confusing. Null scroll IDs are always prone to NPEs.
>>
>> Jörg
>>
>>
>>
>> On Tue, Jun 17, 2014 at 7:46 PM, mooky <[email protected]> wrote:
>>
>>> Having hit a bunch of issues using scroll, I thought I better improve my 
>>> understanding of how scroll is supposed to be used (and how its not 
>>> supposed to be used).
>>>
>>>
>>>    1. Does it make sense to execute a search request with scroll, but 
>>>    SearchType != SCAN?
>>>    2. Does it make sense to execute a search request with scroll, and 
>>>    also with facet/aggregations?
>>>    3. What is the difference between scrolling to the end of the 
>>>    results (ie calling until hits.length ==0) and issuing a specific 
>>>    ClearScrollRequest? It appears to me that the ClearScrollRequest 
>>>    immediately clears the scroll - whereas there is some time delay before 
>>> a 
>>>    scroll is cleaned up after reaching the end of the results. ( I can see 
>>>    this in my tests because the ElasticsearchIntegrationTest fails on 
>>> teardown 
>>>    unless I perform an explicit ClearScrollRequest or I put a delay of some 
>>>    number of seconds). From reading the docs, I am not sure if this a bug 
>>> or 
>>>    expected behaviour. 
>>>    4. Does the scrollId represent the cursor, or the cursor 
>>>    page/iteration state? I have read documentation/mailing list 
>>> explanations 
>>>    that have words to the effect "you must pass the scrollId from the 
>>> previous 
>>>    response into the subsequent request" - which suggests the id represents 
>>>    some cursor state - ie performing a scroll request with a given scrollId 
>>>    will always return the same results. My observation, however, is that 
>>> the 
>>>    scrollId does not change (ie I get back the same scrollId I passed in) 
>>> so 
>>>    each scroll request with the same scrollId advances the 'cursor' until 
>>> no 
>>>    results are returned. I have also read stuff on the mailing list that 
>>>    implied multiple calls could be made in parallel with the same scrollId 
>>> to 
>>>    load all the results faster (which would imply the scrollId is *not* 
>>> expected 
>>>    to change). So which is correct? :) 
>>>
>>>
>>> To explain the background for my questions: I have two requirements :
>>> 1) I get an update event that leads me to go find items in the index 
>>> that need re-indexing. I perform a search on the index, I get the id's and 
>>> I load the original data from the source system(s) to reconstruct the 
>>> document and index it. This seems to be exactly what SCAN and SCROLL is 
>>> meant for. (However, the SCAN search type is different in that it always 
>>> returns zero hits from the original search request - only the scroll 
>>> requests seem to 
>>>
>>> 2) The user normally performs a search, and naturally we limit how many 
>>> results we serve to the client. However, occasionally, the user wants to 
>>> return all the data for a given search/filter (say, to export to excel or 
>>> whatever), so it seems like a good idea to use the scroll rather than 
>>> paging through the results using from&size as we know we will get a 
>>> consistent results even if documents are being added/removed/updated on the 
>>> server.
>>> From a functionality perspective, I want to make sure the scrolling 
>>> search request is the same as the non-scrolling search request so the user 
>>> gets the same results - so from a code perspective, ideally I really want 
>>> to make the codepath the same (save for adding the scroll keepAlive param). 
>>> However, perhaps there are things I perform with my normal search (e.g. 
>>> aggregations, SearchType.DEFAULT, etc) that just don't make sense when 
>>> scrolling?
>>>
>>> Many thanks.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b88cfdb7-1072-45a4-9c0e-a4bf77be8226%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Scroll Questions

Reply via email to