[ 
https://issues.apache.org/jira/browse/SOLR-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057512#comment-18057512
 ] 

Nazerke Seidan commented on SOLR-18087:
---------------------------------------

Following up on this issue, I investigated it and am sharing some findings 
below. 

h3. Context
* Solr uses Jetty 12.0.27
* Default Jetty HTTP/2 flow control windows are relatively small: 
   {code}
   jetty.http2.initialSessionRecvWindow=1048576 (1024 KiB) 
   jetty.http2.initialStreamRecvWindow=524288 (512 KiB)
  {code}


h3. Jetty InputStreamResponseListener

For HTTP/2 stalling case, In Jetty 
[InputStreamResponseListener|https://github.com/jetty/jetty.project/blob/4197998ac76936e76b3f35cd62dcff8b1ad03064/jetty-core/jetty-client/src/main/java/org/eclipse/jetty/client/InputStreamResponseListener.java#L100],
 InputStreamResponseListener#onContent method uses a pull model. When data 
arrives, the demander (which triggers WINDOW_UPDATE) is put into the 
ChunkCallback queue rather than being executed immediately. The session window 
only increments after  Solr consumes the data. If Solr doesn't process the 
InputStream fast enough to trigger the queued demander.run(), the entire 
connection deadlocks because of the session window=0.

An existing issue on InputStreamResponseListener class, but not the core cause 
for this issue, which is using an infinite wait on read: [Jetty Issue #7259 
|https://github.com/jetty/jetty.project/issues/7259]

h3. Jetty discussion thread

I asked in [Jetty 
discussion|https://github.com/jetty/jetty.project/discussions/14444], pointing 
out this issue, and the Jetty maintainer confirms:
* HTTP/1.1 performs better on multiple connections than HTTP/2 with a single 
connection: this comes from the DATA frame overhead (9 bytes every 16 KiB),
and the WINDOW_UPDATE notifications from the receiving peer.
* A tuned HTTP/2 setup should not stall; if it stalls due to flow control 
windows, increase client receive windows, sized using the bandwidth-delay 
product (BDP). 
* Fix client side first, so consumes fast (maybe the client is slow due to 
parsing/gc)

h3. Experimentation (StreamingSearch)

Back to Luke's benchmark, I tweaked by changing some params for the 
StreamingSearch class, to check if the Solr client takes time for big 
payload/parsing/serialization.
Updated StreamingSearch class: 
* stream.countTuples() → just to count tuples, instead of returning the whole 
list of tuples #getTuples()
* fl=id (remove other fields)
* SORT=id asc (sort by id only)

Another point for stream, CloudSolrStream, a single thread is used to iterate 
and merge Tuples from each SolrStream. Large payloads + single-thread parsing 
can delay consumption and therefore delay WINDOW_UPDATEs, which can make HTTP/2 
appear “stalled” relative to HTTP/1.1  multiconnection behavior.


No change, just rerun the benchmark with increased flow control windows: 
{code}
-p nodeCount=2 -p numShards=12 -p numReplicas=2 -p docCount=10000 -p 
indexThreads=14 -p batchSize=500 -p docSizeBytes=10024 -p numTextFields=25 
-jvmArgs -Dsolr.http2.initialStreamRecvWindow=8000000 -jvmArgs 
-Dsolr.http2.initialSessionRecvWindow=96000000 StreamingSearch
{code}

HTTP/2
{code}
Benchmark               (batchSize)  (docCount)  (docSizeBytes)  (indexThreads) 
 (nodeCount)  (numReplicas)  (numShards)  (numTextFields)  (useHttp1)   Mode  
Cnt  Score   Error  Units
StreamingSearch.stream          500       10000           10024              14 
           2              2           12               25       false  thrpt    
4  2.509 ± 0.473  ops/s
{code}

HTTP/1.1
{code}
Benchmark               (batchSize)  (docCount)  (docSizeBytes)  (indexThreads) 
 (nodeCount)  (numReplicas)  (numShards)  (numTextFields)  (useHttp1)   Mode  
Cnt  Score   Error  Units
StreamingSearch.stream          500       10000           10024              14 
           2              2           12               25        true  thrpt   
10  3.260 ± 0.074  ops/s
{code}

With StreamingSearch change (smaller payload),

HTTP/2:

{code:java}
Benchmark               (batchSize)  (docCount)  (docSizeBytes)  (indexThreads) 
 (nodeCount)  (numReplicas)  (numShards)  (numTextFields)  (useHttp1)   Mode  
Cnt    Score   Error  Units
StreamingSearch.stream          500       10000           10024              14 
           2              2           12               25       false  thrpt   
20  127.439 ± 7.772  ops/s
{code}

HTTP/1.1
{code}
Benchmark               (batchSize)  (docCount)  (docSizeBytes)  (indexThreads) 
 (nodeCount)  (numReplicas)  (numShards)  (numTextFields)  (useHttp1)   Mode  
Cnt    Score   Error  Units
StreamingSearch.stream          500       10000           10024              14 
           2              2           12               25        true  thrpt   
20  130.061 ± 2.783  ops/s
{code}

With smaller payload:
HTTP/2: 127.439 ± 7.772 ops/s
HTTP/1.1: 130.061 ± 2.783 ops/s

This suggests the large payload regression is an addition by Solr's end-to-end 
processing (parsing/serialization/allocation), not only by HTTP/2 transport 
overhead.

Also,* -prof gc * shows the stream benchmark becomes allocation/GC heavy on the 
client side (~22 MB allocated per operation, ~2.7 GB).

I have not done server-side benchmarking with a "dump 
client"/minimal-consumption client to isolate pure transport and validate 
BDP-based window sizing.



> HTTP/2 Struggles With Streaming Large Responses
> -----------------------------------------------
>
>                 Key: SOLR-18087
>                 URL: https://issues.apache.org/jira/browse/SOLR-18087
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Luke Kot-Zaniewski
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: flow-control-stall.log, index-recovery-tests.md, 
> stream-benchmark-results.md
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appear to be some severe regressions after expansion of HTTP/2 client 
> usage since at least 9.8, most notably with the stream handler as well as 
> index recovery. The impact is at the very least slowness and in some cases 
> outright response stalling. The obvious thing these two very different 
> workloads share in common is that they stream large responses. This means, 
> among other things, that they may be more directly impacted by HTTP2's flow 
> control mechanism. More specifically, the response stalling appears to be 
> caused by session window "cannibalization", i.e.  shards 1 and 2's responses 
> occupy the entirety of the session window *but* haven't been consumed yet, 
> and then, say, TupleStream calls next on shard N (because it is at the top of 
> the priority queue) but the server has nowhere to put this response since 
> shards 1 and 2 have exhausted the client buffer.
> In my testing I have tweaked the following parameters:
>  # http1 vs http2 - as stated, http1 seems to be strictly better as in faster 
> and more stable.
>  # shards per node - the greater the number of shards per node the more 
> (large, simultaneous) responses share a single connection during inter-node 
> communication. This has generally resulted in poorer performance.
>  # maxConcurrentStreams - reducing this to, say 1, can effectively circumvent 
> multiplexing. Circumventing multiplexing does seem to improve index recovery 
> in HTTP/2 but this is not a good setting to keep for production use because 
> it is global and affects *everything*, not just recovery or streaming.
>  # initialSessionRecvWindow - This is the amount of buffer the client gets 
> initially for each connection. This gets shared by the many responses that 
> share the multiplexed connection.
>  #  initialStreamRecvWindow - This is the amount of buffer each stream gets 
> initially within a single HTTP/2 session. I've found that when this is too 
> big relative to initialSessionRecvWindow it can lead to stalling because of 
> flow control enforcement
> # Simple vs Buffering Flow Control Strategy - Controls how frequently the 
> client sends a WINDOW_UPDATE frame to signal the server to send more data. 
> "Simple" sends the frame after consuming any amount of bytes while 
> "Buffering" waits until a consumption threshold is met. So far "Simple" has 
> NOT worked reliably for me and probably why the default is "Buffering".
> I’m attaching summaries of my findings, some of which can be reproduced by 
> running the appropriate benchmark in this 
> [branch|https://github.com/kotman12/solr/tree/http2-shenanigans|https://github.com/kotman12/solr/tree/http2-shenanigans].
>  The stream benchmark results md file includes the command I ran to achieve 
> the result described. 
> Next steps:
> Reproduce this in a pure jetty example. I am beginning to think multiple 
> large responses getting streamed simultaneously between the same client and 
> server may some kind of edge case in the library or protocol, itself. It may 
> have something to do with how Jetty's InputStreamResponseListener is 
> implemented although according to the docs it _should_ be compatible with 
> HTTP/2. Furthermore, there may be some other levers offered by HTTP/2 which 
> are not yet exposed by the Jetty API.
> On the other hand, we could consider having separate connection pools for 
> HTTP clients that stream large responses. There seems to be at least [some 
> precedent|https://www.akamai.com/site/en/documents/research-paper/domain-sharding-for-faster-http2-in-lossy-cellular-networks.pdf]
>  for doing this.
> > We investigate and develop a new domain-sharding technique that isolates 
> > large downloads on separate TCP connections, while keeping downloads of 
> > small objects on a single connection.
> HTTP/2 seems designed for [bursty, small 
> traffic|https://hpbn.co/http2/?utm_source=chatgpt.com#one-connection-per-origin]
>  which is why flow-control may not impact it as much. Also, if your payload 
> is small relative to your header then HTTP/2's header compression might be a 
> big win for you but in the case of large responses, not as much. 
> > Most HTTP transfers are short and bursty, whereas TCP is optimized for 
> > long-lived, bulk data transfers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to