Patson Luk created SOLR-17143:
---------------------------------
Summary: Streaming with multiple shards can triggered unexpected
IdleTimeout
Key: SOLR-17143
URL: https://issues.apache.org/jira/browse/SOLR-17143
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SolrCloud
Affects Versions: 9.4.1
Reporter: Patson Luk
With the new test case submitted, we re-produced an issue with streaming in our
production cloud environment.
The test case demonstrates that with a collection of 2 shards, which 20k docs
are indexed. 10k docs have id with routing prefix `a`, while the other 10k with
`c`. Each of those prefix would hash to different shard, producing 2 shards of
10k docs each.
Now, if we stream by sorting on the id, both shards would send back some data
initially, however only one shard (that hosts prefix `a`) will have continued
traffic due to the sorted iteration, the other shard would eventually throw
{{IdleTimeout}} as the stream was pending w/o network activity.
If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs fine.
In our environment, we have jetty http connector timeout as 120 secs, yet we
still run into that occasionally, the client does consume the data in a
reasonable rate, however with up to 1024 shards per collection, it's quite easy
that some shards might not have data streamed within 120 secs hence triggering
the mentioned timeout.
We assume such issue with streaming is not uncommon for any distributed system,
and am wondering what could be done to fix or mitigate that.
Several ideas that we have:
1. If possible, we might want to stream per shard instead of per collection.
However, there are cases that we do want to stream on the whole collection with
sorted ordering
2. Are there any low level "keep-alive" that is already built in? I couldn't
find any so far :)
3. Keep the stream alive by pushing small amount of dummy data from the
aggregator (the solr node which distributes the stream request as /export to
other nodes) but it got very hacky and is still not working. Didn't dig too
deep as I wish to surface this issue to the Solr community and gather some
thoughts first!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]