Further to (2). Would it be an improvement to have a different kind of request for a scrolling search - that way the api could exclude items that don't make sense (e.g. aggregations, facets, etc)
On Wednesday, 18 June 2014 10:28:06 UTC+1, mooky wrote: > > Many thanks Jörg. > > Further questions/comments inline: > > >> 1. yes > > > Thanks, > > 2. facet/aggregations are not very useful while scrolling (I doubt they >> even work at all) because scrolling works on shard level and aggregations >> work on indices level > > > If they are not expected to work, would it make sense to either: > > 1. prevent aggregation/facet requests in conjunction with scroll > requests (ie give an error to the user) > 2. Simply not execute them? > > If it doesn't make sense, would it be better to not return any > aggregation/facet results at all? > > 3. a scroll request takes resources. The purpose of ClearScrollRequest is >> to release those resources explicitly. This is indeed a rare situation when >> you need explicit clearing. The time delay of releasing scrolls implicitly >> can be controlled by the requests. > > > Do you mean the keepAlive time? So, does the scroll (and its resources) > always remain for the duration of the keepAlive (since the last request on > that scroll) regardless of whether the end of the scroll was reached or not? > > I read the following (from the documentation) to imply that reading to the > end of the scroll had the effect of "aborting" and therefore cleaning up > resources. > > Besides consuming the scroll search until no hits has been returned a > scroll search can also be aborted by deleting the scroll_id > > So, just to confirm, reading to the end of the results does nothing in > terms of bringing about the cleanup of the scroll? Its either the TTL or > the ClearScrollRequest that brings about the cleanup of resources. > > Is there any downside to calling ClearScrollRequest explicitly? > (I am inclined to call it explicitly when the end of the scroll is reached > in order clean up resources asap) > > > 4. yes, the scroll id is an encoding of the combined state of all the >> shards that participate in the scroll. Even if the ID looks as if it has >> not changed, you should always use the latest reference to the scroll ID in >> the response, or you may clutter the nodes with unreleased scroll resources. > > > Thanks for the explanation. > > A null scroll ID is a matter of API design. By using hit length check for >> 0, you can use the same condition for other queries, so it is convenient >> and not confusing. Null scroll IDs are always prone to NPEs. > > > Agreed. Its a matter of API style/design. > The only issue I have with checking hits.length is that depending on the > SearchType, sometimes hits.length==0 does not mean the end of the results > (e.g. SearchType.SCAN). Its the lack of consistency that bothers me about > it. It requires the code that handles results to be aware of a detail of > the request. > > My case for using scrollId is that: > The scrollId is already null if no scroll is requested. > For this reason, (IMO) scrollId==null would be a more consistent indicator > of no scrolling required - or no further scrolling required. Also it would > re-enforce the notion that the user should always use/observe the returned > scrollId - they would have to. > > Cheers, > -Nick > > > On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote: >> >> 1. yes >> >> 2. facet/aggregations are not very useful while scrolling (I doubt they >> even work at all) because scrolling works on shard level and aggregations >> work on indices level >> >> 3. a scroll request takes resources. The purpose of ClearScrollRequest is >> to release those resources explicitly. This is indeed a rare situation when >> you need explicit clearing. The time delay of releasing scrolls implicitly >> can be controlled by the requests. >> >> 4. yes, the scroll id is an encoding of the combined state of all the >> shards that participate in the scroll. Even if the ID looks as if it has >> not changed, you should always use the latest reference to the scroll ID in >> the response, or you may clutter the nodes with unreleased scroll resources. >> >> Scrolling is very different from search, because there is a shard-level >> machinery that iterates over the Lucene segments and keep them open. This >> tends to ramp up lots of server-side resources, which may long-lived - a >> challenge for resource management. There is a reaper thread that wakes up >> from time to time to take care of stray scroll searches. You observed this >> as a "time delay". Ordinary search actions never keep resources open at >> shard level. >> >> Using scroll search for creating large CSV exports is adequate because >> this iterates through the result set doc by doc. But replacing a >> full-fledged search that has facets/filters/aggregations/sorting with a >> scroll search, you will only create large overheads (if it is even >> possible). >> >> A null scroll ID is a matter of API design. By using hit length check for >> 0, you can use the same condition for other queries, so it is convenient >> and not confusing. Null scroll IDs are always prone to NPEs. >> >> Jörg >> >> >> >> On Tue, Jun 17, 2014 at 7:46 PM, mooky <[email protected]> wrote: >> >>> Having hit a bunch of issues using scroll, I thought I better improve my >>> understanding of how scroll is supposed to be used (and how its not >>> supposed to be used). >>> >>> >>> 1. Does it make sense to execute a search request with scroll, but >>> SearchType != SCAN? >>> 2. Does it make sense to execute a search request with scroll, and >>> also with facet/aggregations? >>> 3. What is the difference between scrolling to the end of the >>> results (ie calling until hits.length ==0) and issuing a specific >>> ClearScrollRequest? It appears to me that the ClearScrollRequest >>> immediately clears the scroll - whereas there is some time delay before >>> a >>> scroll is cleaned up after reaching the end of the results. ( I can see >>> this in my tests because the ElasticsearchIntegrationTest fails on >>> teardown >>> unless I perform an explicit ClearScrollRequest or I put a delay of some >>> number of seconds). From reading the docs, I am not sure if this a bug >>> or >>> expected behaviour. >>> 4. Does the scrollId represent the cursor, or the cursor >>> page/iteration state? I have read documentation/mailing list >>> explanations >>> that have words to the effect "you must pass the scrollId from the >>> previous >>> response into the subsequent request" - which suggests the id represents >>> some cursor state - ie performing a scroll request with a given scrollId >>> will always return the same results. My observation, however, is that >>> the >>> scrollId does not change (ie I get back the same scrollId I passed in) >>> so >>> each scroll request with the same scrollId advances the 'cursor' until >>> no >>> results are returned. I have also read stuff on the mailing list that >>> implied multiple calls could be made in parallel with the same scrollId >>> to >>> load all the results faster (which would imply the scrollId is *not* >>> expected >>> to change). So which is correct? :) >>> >>> >>> To explain the background for my questions: I have two requirements : >>> 1) I get an update event that leads me to go find items in the index >>> that need re-indexing. I perform a search on the index, I get the id's and >>> I load the original data from the source system(s) to reconstruct the >>> document and index it. This seems to be exactly what SCAN and SCROLL is >>> meant for. (However, the SCAN search type is different in that it always >>> returns zero hits from the original search request - only the scroll >>> requests seem to >>> >>> 2) The user normally performs a search, and naturally we limit how many >>> results we serve to the client. However, occasionally, the user wants to >>> return all the data for a given search/filter (say, to export to excel or >>> whatever), so it seems like a good idea to use the scroll rather than >>> paging through the results using from&size as we know we will get a >>> consistent results even if documents are being added/removed/updated on the >>> server. >>> From a functionality perspective, I want to make sure the scrolling >>> search request is the same as the non-scrolling search request so the user >>> gets the same results - so from a code perspective, ideally I really want >>> to make the codepath the same (save for adding the scroll keepAlive param). >>> However, perhaps there are things I perform with my normal search (e.g. >>> aggregations, SearchType.DEFAULT, etc) that just don't make sense when >>> scrolling? >>> >>> Many thanks. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b88cfdb7-1072-45a4-9c0e-a4bf77be8226%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
