Patson Luk created SOLR-17143:
---------------------------------

             Summary: Streaming with multiple shards can triggered unexpected 
IdleTimeout
                 Key: SOLR-17143
                 URL: https://issues.apache.org/jira/browse/SOLR-17143
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 9.4.1
            Reporter: Patson Luk


With the new test case submitted, we re-produced an issue with streaming in our 
production cloud environment. 

The test case demonstrates that with a collection of 2 shards, which 20k docs 
are indexed. 10k docs have id with routing prefix `a`, while the other 10k with 
`c`. Each of those prefix would hash to different shard, producing 2 shards of 
10k docs each.

Now, if we stream by sorting on the id, both shards would send back some data 
initially, however only one shard (that hosts prefix `a`) will have continued 
traffic due to the sorted iteration, the other shard would eventually throw 
{{IdleTimeout}} as the stream was pending w/o network activity.

If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs fine. 

In our environment, we have jetty http connector timeout as 120 secs, yet we 
still run into that occasionally, the client does consume the data in a 
reasonable rate, however with up to 1024 shards per collection, it's quite easy 
that some shards might not have data streamed within 120 secs hence triggering 
the mentioned timeout.


We assume such issue with streaming is not uncommon for any distributed system, 
and am wondering what could be done to fix or mitigate that. 

Several ideas that we have:
1. If possible, we might want to stream per shard instead of per collection. 
However, there are cases that we do want to stream on the whole collection with 
sorted ordering
2. Are there any low level "keep-alive" that is already built in? I couldn't 
find any so far :)
3. Keep the stream alive by pushing small amount of dummy data from the 
aggregator (the solr node which distributes the stream request as /export to 
other nodes) but it got very hacky and is still not working. Didn't dig too 
deep as I wish to surface this issue to the Solr community and gather some 
thoughts first!





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to