[
https://issues.apache.org/jira/browse/SOLR-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patson Luk updated SOLR-17143:
------------------------------
Summary: Streaming with multiple shards can trigger unexpected IdleTimeout
(was: Streaming with multiple shards can triggered unexpected IdleTimeout)
> Streaming with multiple shards can trigger unexpected IdleTimeout
> -----------------------------------------------------------------
>
> Key: SOLR-17143
> URL: https://issues.apache.org/jira/browse/SOLR-17143
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 9.4.1
> Reporter: Patson Luk
> Priority: Critical
>
> With the new [test case
> submitted|https://github.com/cowpaths/fullstory-solr/commit/383134928e372f19d96b1b16459a3566169d3ff4]
> , we re-produced an issue with streaming in our production cloud
> environment.
> The test case demonstrates that with a collection of 2 shards, which 20k docs
> are indexed. 10k docs have id with routing prefix `a`, while the other 10k
> with `c`. Each of those prefix would hash to different shard, producing 2
> shards of 10k docs each.
> Now, if we stream by sorting on the id, both shards would send back some data
> initially, however only one shard (that hosts prefix `a`) will have continued
> traffic due to the sorted iteration, the other shard would eventually throw
> {{IdleTimeout}} as the stream was pending w/o network activity.
> If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs
> fine.
> In our environment, we have jetty http connector timeout as 120 secs, yet we
> still run into that occasionally, the client does consume the data in a
> reasonable rate, however with up to 1024 shards per collection, it's quite
> easy that some shards might not have data streamed within 120 secs hence
> triggering the mentioned timeout.
> We assume such issue with streaming is not uncommon for any distributed
> system, and am wondering what could be done to fix or mitigate that.
> Several ideas that we have:
> 1. If possible, we might want to stream per shard instead of per collection.
> However, there are cases that we do want to stream on the whole collection
> with sorted ordering
> 2. Are there any low level "keep-alive" that is already built in? I couldn't
> find any so far :)
> 3. Keep the stream alive by pushing small amount of dummy data from the
> aggregator (the solr node which distributes the stream request as /export to
> other nodes) but it got very hacky and is still not working. Didn't dig too
> deep as I wish to surface this issue to the Solr community and gather some
> thoughts first!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]