[
https://issues.apache.org/jira/browse/SOLR-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057512#comment-18057512
]
Nazerke Seidan edited comment on SOLR-18087 at 2/10/26 6:50 AM:
----------------------------------------------------------------
Following up on this issue, I investigated it and am sharing some findings
below.
h3. Context
* Solr uses Jetty 12.0.27
* Default Jetty HTTP/2 flow control windows are relatively small:
{code}
jetty.http2.initialSessionRecvWindow=1048576 (1024 KiB)
jetty.http2.initialStreamRecvWindow=524288 (512 KiB)
{code}
h3. Jetty InputStreamResponseListener
For HTTP/2 stalling case, In Jetty
[InputStreamResponseListener|https://github.com/jetty/jetty.project/blob/4197998ac76936e76b3f35cd62dcff8b1ad03064/jetty-core/jetty-client/src/main/java/org/eclipse/jetty/client/InputStreamResponseListener.java#L100],
InputStreamResponseListener#onContent method uses a pull model. When data
arrives, the demander (which triggers WINDOW_UPDATE) is enqueued (ChunkCallback
queue) rather than being executed immediately. The session window only
increments after Solr consumes the data. If Solr doesn't process the
InputStream fast enough to trigger the queued demander.run(), the entire
connection deadlocks because of the session window=0.
An existing issue on InputStreamResponseListener class, but not the core cause
for this issue, which is using an infinite wait on read: [Jetty Issue #7259
|https://github.com/jetty/jetty.project/issues/7259]
h3. Jetty discussion thread
I asked in [Jetty
discussion|https://github.com/jetty/jetty.project/discussions/14444], pointing
out this issue, and the Jetty maintainer confirms:
* HTTP/1.1 performs better on multiple connections than HTTP/2 with a single
connection: this comes from the DATA frame overhead (9 bytes every 16 KiB), and
the WINDOW_UPDATE notifications from the receiving peer.
* A tuned HTTP/2 setup should not stall; if it stalls due to flow control
windows, increase client receive windows, sized using the bandwidth-delay
product (BDP).
* Fix client side first, so consumes fast (maybe the client is slow due to
parsing/gc)
h3. Experimentation (StreamingSearch)
Back to Luke's benchmark, I tweaked by changing some params for the
StreamingSearch class, to check if the Solr client takes time for big
payload/parsing/serialization.
Updated StreamingSearch class:
* stream.countTuples() → just to count tuples, instead of returning the whole
list of tuples #getTuples()
* fl=id (remove other fields)
* SORT=id asc (sort by id only)
Another point for stream, CloudSolrStream, a single thread is used to iterate
and merge Tuples from each SolrStream. Large payloads + single-thread parsing
can delay consumption and therefore delay WINDOW_UPDATEs, which can make HTTP/2
appear “stalled” relative to HTTP/1.1 multiconnection behavior.
No change, just rerun the benchmark with increased flow control windows:
{code}
-p nodeCount=2 -p numShards=12 -p numReplicas=2 -p docCount=10000 -p
indexThreads=14 -p batchSize=500 -p docSizeBytes=10024 -p numTextFields=25
-jvmArgs -Dsolr.http2.initialStreamRecvWindow=8000000 -jvmArgs
-Dsolr.http2.initialSessionRecvWindow=96000000 StreamingSearch
{code}
HTTP/2
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 false thrpt
4 2.509 ± 0.473 ops/s
{code}
HTTP/1.1
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 true thrpt
10 3.260 ± 0.074 ops/s
{code}
With StreamingSearch change (smaller payload),
HTTP/2:
{code:java}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 false thrpt
20 127.439 ± 7.772 ops/s
{code}
HTTP/1.1
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 true thrpt
20 130.061 ± 2.783 ops/s
{code}
With smaller payload:
HTTP/2: 127.439 ± 7.772 ops/s
HTTP/1.1: 130.061 ± 2.783 ops/s
Also, *-prof gc* shows the stream benchmark becomes allocation/GC heavy on
the client side (~22 MB allocated per operation, ~2.7 GB).
This suggests the large payload regression is an addition by Solr's end-to-end
processing (parsing/serialization/allocation), not only by HTTP/2 transport
overhead.
I have not done server-side benchmarking with a "dump
client"/minimal-consumption client to isolate pure transport and validate
BDP-based window sizing.
was (Author: nazerke):
Following up on this issue, I investigated it and am sharing some findings
below.
h3. Context
* Solr uses Jetty 12.0.27
* Default Jetty HTTP/2 flow control windows are relatively small:
{code}
jetty.http2.initialSessionRecvWindow=1048576 (1024 KiB)
jetty.http2.initialStreamRecvWindow=524288 (512 KiB)
{code}
h3. Jetty InputStreamResponseListener
For HTTP/2 stalling case, In Jetty
[InputStreamResponseListener|https://github.com/jetty/jetty.project/blob/4197998ac76936e76b3f35cd62dcff8b1ad03064/jetty-core/jetty-client/src/main/java/org/eclipse/jetty/client/InputStreamResponseListener.java#L100],
InputStreamResponseListener#onContent method uses a pull model. When data
arrives, the demander (which triggers WINDOW_UPDATE) is enqueued (ChunkCallback
queue) rather than being executed immediately. The session window only
increments after Solr consumes the data. If Solr doesn't process the
InputStream fast enough to trigger the queued demander.run(), the entire
connection deadlocks because of the session window=0.
An existing issue on InputStreamResponseListener class, but not the core cause
for this issue, which is using an infinite wait on read: [Jetty Issue #7259
|https://github.com/jetty/jetty.project/issues/7259]
h3. Jetty discussion thread
I asked in [Jetty
discussion|https://github.com/jetty/jetty.project/discussions/14444], pointing
out this issue, and the Jetty maintainer confirms:
* HTTP/1.1 performs better on multiple connections than HTTP/2 with a single
connection: this comes from the DATA frame overhead (9 bytes every 16 KiB), and
the WINDOW_UPDATE notifications from the receiving peer.
* A tuned HTTP/2 setup should not stall; if it stalls due to flow control
windows, increase client receive windows, sized using the bandwidth-delay
product (BDP).
* Fix client side first, so consumes fast (maybe the client is slow due to
parsing/gc)
h3. Experimentation (StreamingSearch)
Back to Luke's benchmark, I tweaked by changing some params for the
StreamingSearch class, to check if the Solr client takes time for big
payload/parsing/serialization.
Updated StreamingSearch class:
* stream.countTuples() → just to count tuples, instead of returning the whole
list of tuples #getTuples()
* fl=id (remove other fields)
* SORT=id asc (sort by id only)
Another point for stream, CloudSolrStream, a single thread is used to iterate
and merge Tuples from each SolrStream. Large payloads + single-thread parsing
can delay consumption and therefore delay WINDOW_UPDATEs, which can make HTTP/2
appear “stalled” relative to HTTP/1.1 multiconnection behavior.
No change, just rerun the benchmark with increased flow control windows:
{code}
-p nodeCount=2 -p numShards=12 -p numReplicas=2 -p docCount=10000 -p
indexThreads=14 -p batchSize=500 -p docSizeBytes=10024 -p numTextFields=25
-jvmArgs -Dsolr.http2.initialStreamRecvWindow=8000000 -jvmArgs
-Dsolr.http2.initialSessionRecvWindow=96000000 StreamingSearch
{code}
HTTP/2
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 false thrpt
4 2.509 ± 0.473 ops/s
{code}
HTTP/1.1
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 true thrpt
10 3.260 ± 0.074 ops/s
{code}
With StreamingSearch change (smaller payload),
HTTP/2:
{code:java}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 false thrpt
20 127.439 ± 7.772 ops/s
{code}
HTTP/1.1
{code}
Benchmark (batchSize) (docCount) (docSizeBytes) (indexThreads)
(nodeCount) (numReplicas) (numShards) (numTextFields) (useHttp1) Mode
Cnt Score Error Units
StreamingSearch.stream 500 10000 10024 14
2 2 12 25 true thrpt
20 130.061 ± 2.783 ops/s
{code}
With smaller payload:
HTTP/2: 127.439 ± 7.772 ops/s
HTTP/1.1: 130.061 ± 2.783 ops/s
Also,* -prof gc * shows the stream benchmark becomes allocation/GC heavy on the
client side (~22 MB allocated per operation, ~2.7 GB).
This suggests the large payload regression is an addition by Solr's end-to-end
processing (parsing/serialization/allocation), not only by HTTP/2 transport
overhead.
I have not done server-side benchmarking with a "dump
client"/minimal-consumption client to isolate pure transport and validate
BDP-based window sizing.
> HTTP/2 Struggles With Streaming Large Responses
> -----------------------------------------------
>
> Key: SOLR-18087
> URL: https://issues.apache.org/jira/browse/SOLR-18087
> Project: Solr
> Issue Type: Bug
> Reporter: Luke Kot-Zaniewski
> Priority: Major
> Labels: pull-request-available
> Attachments: flow-control-stall.log, index-recovery-tests.md,
> stream-benchmark-results.md
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There appear to be some severe regressions after expansion of HTTP/2 client
> usage since at least 9.8, most notably with the stream handler as well as
> index recovery. The impact is at the very least slowness and in some cases
> outright response stalling. The obvious thing these two very different
> workloads share in common is that they stream large responses. This means,
> among other things, that they may be more directly impacted by HTTP2's flow
> control mechanism. More specifically, the response stalling appears to be
> caused by session window "cannibalization", i.e. shards 1 and 2's responses
> occupy the entirety of the session window *but* haven't been consumed yet,
> and then, say, TupleStream calls next on shard N (because it is at the top of
> the priority queue) but the server has nowhere to put this response since
> shards 1 and 2 have exhausted the client buffer.
> In my testing I have tweaked the following parameters:
> # http1 vs http2 - as stated, http1 seems to be strictly better as in faster
> and more stable.
> # shards per node - the greater the number of shards per node the more
> (large, simultaneous) responses share a single connection during inter-node
> communication. This has generally resulted in poorer performance.
> # maxConcurrentStreams - reducing this to, say 1, can effectively circumvent
> multiplexing. Circumventing multiplexing does seem to improve index recovery
> in HTTP/2 but this is not a good setting to keep for production use because
> it is global and affects *everything*, not just recovery or streaming.
> # initialSessionRecvWindow - This is the amount of buffer the client gets
> initially for each connection. This gets shared by the many responses that
> share the multiplexed connection.
> # initialStreamRecvWindow - This is the amount of buffer each stream gets
> initially within a single HTTP/2 session. I've found that when this is too
> big relative to initialSessionRecvWindow it can lead to stalling because of
> flow control enforcement
> # Simple vs Buffering Flow Control Strategy - Controls how frequently the
> client sends a WINDOW_UPDATE frame to signal the server to send more data.
> "Simple" sends the frame after consuming any amount of bytes while
> "Buffering" waits until a consumption threshold is met. So far "Simple" has
> NOT worked reliably for me and probably why the default is "Buffering".
> I’m attaching summaries of my findings, some of which can be reproduced by
> running the appropriate benchmark in this
> [branch|https://github.com/kotman12/solr/tree/http2-shenanigans|https://github.com/kotman12/solr/tree/http2-shenanigans].
> The stream benchmark results md file includes the command I ran to achieve
> the result described.
> Next steps:
> Reproduce this in a pure jetty example. I am beginning to think multiple
> large responses getting streamed simultaneously between the same client and
> server may some kind of edge case in the library or protocol, itself. It may
> have something to do with how Jetty's InputStreamResponseListener is
> implemented although according to the docs it _should_ be compatible with
> HTTP/2. Furthermore, there may be some other levers offered by HTTP/2 which
> are not yet exposed by the Jetty API.
> On the other hand, we could consider having separate connection pools for
> HTTP clients that stream large responses. There seems to be at least [some
> precedent|https://www.akamai.com/site/en/documents/research-paper/domain-sharding-for-faster-http2-in-lossy-cellular-networks.pdf]
> for doing this.
> > We investigate and develop a new domain-sharding technique that isolates
> > large downloads on separate TCP connections, while keeping downloads of
> > small objects on a single connection.
> HTTP/2 seems designed for [bursty, small
> traffic|https://hpbn.co/http2/?utm_source=chatgpt.com#one-connection-per-origin]
> which is why flow-control may not impact it as much. Also, if your payload
> is small relative to your header then HTTP/2's header compression might be a
> big win for you but in the case of large responses, not as much.
> > Most HTTP transfers are short and bursty, whereas TCP is optimized for
> > long-lived, bulk data transfers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]