kotman12 commented on code in PR #4053: URL: https://github.com/apache/solr/pull/4053#discussion_r2723442052
########## solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc: ########## @@ -23,6 +23,26 @@ This feature uses a stream sorting technique that begins to send records within The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats. +== Comparison with Cursors + +The `/export` handler offers several advantages over xref:pagination-of-results.adoc#fetching-a-large-number-of-sorted-results-cursors[cursor-based pagination] for streaming large result sets. + +With cursors, the query is re-executed for each page of results. +In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over. +Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors. +The trade-off is that IndexReaders are kept around for longer periods of time. + +Another advantage of `/export` is significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size. +With cursors, you typically need to set the `rows` parameter to a high value (e.g., 100,000) to achieve decent throughput. +However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits. +Only after receiving and decoding this large payload can the client request the next batch, but in the interim Solr sits idle on this request. +With the `/export` handler, these steps are decoupled - Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client. + +The advantage of cursors is flexibility. +A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session. Review Comment: I do feel it is implied that an export stream doesn't need to be fully consumed. Like what if the client crashes? It would be unreasonable to implement export in a way that can't handle a crashing client. I suppose one could mention `close` but not sure if this is the right place. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
