[
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585100#comment-14585100
]
Erick Erickson edited comment on SOLR-7571 at 6/14/15 3:46 PM:
---------------------------------------------------------------
[[email protected]]
bq: Providing this via JMX, which does not stand between client and server on
every request, and is checked independently of search requests, maaaay in some
ways be better
If we do it that way, how do you handle a many-shard situation? The client
would have to ping all the nodes they might care about it seems. I was
thinking of a couple of optional params. This particular issue is that _some_
node is running hot, I don't particularly care which. So all that's necessary
to return with each request is the high water mark, thus adding only a single
return value per metric.
Actually I'm thinking of some like this:
metrics=true|false, default to false. What we do now if false.
metrics.detail=true|false default false. Return only the highest value of each
metric from any replica if false, otherwise return all the metrics from all the
replicas.
metrics.which=comma-delimited-list. The list of things we want to return, the
one I'm thinking of as a PoC is "threadcount".
was (Author: erickerickson):
[[email protected]]
bq: Providing this via JMX, which does not stand between client and server on
every request, and is checked independently of search requests, maaaay in some
ways be better
If we do it that way, how do you handle a many-shard update? The client would
have to ping them all it seems. For this particular case, the metric I was
thinking of a couple of optional params. This particular issue is that _some_
node is running hot, I don't particularly care which. So all that's necessary
to return with each request is the high water mark, thus adding only a single
return value per metric.
Actually I'm thinking of some like this:
metrics=true|false, default to false. What we do now.
metrics.detail=true|false default false. Return the highest value from any
replica.
metrics.which=comma-delimited-list. The list of things we want to return, the
one I'm thinking of as PoC is "threadcount".
> Return metrics with update requests to allow clients to self-throttle
> ---------------------------------------------------------------------
>
> Key: SOLR-7571
> URL: https://issues.apache.org/jira/browse/SOLR-7571
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.10.3
> Reporter: Erick Erickson
> Assignee: Erick Erickson
>
> I've assigned this to myself to keep track of it, anyone who wants please
> feel free to take this.
> I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client
> (and post.jar for json files for that matter) firehose updates (150 separate
> threads in total) at Solr. Eventually, replicas (not leaders) go into
> recovery and the state cascades and eventually the entire cluster becomes
> unusable. SOLR-5850 delays the behavior, but it still occurs. There are no
> errors in the follower's logs this is leader-initiated-recovery because of a
> timeout.
> I think the root problem is that the client is just sending too many requests
> to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to
> distribute update requests to all the followers) (this was observed in Solr
> 4.10.3+). I see thread counts of 500+ when this happens.
> So assuming that this is the root cause, the obvious "cure" is "don't index
> that fast". This is unsatisfactory since "that fast" is variable, the only
> recourse is to set that threshold low enough that the Solr cluster isn't
> being driven as fast is it can be.
> We should provide some mechanism for having the client throttle itself. The
> number of outstanding update threads is one possibility. The client could
> then slow down sending updates to Solr.
> I'm not sure there's a good way to deal with this on the server. Once the
> timeout is encountered, you don't know whether the doc has actually been
> indexed on the follower (actually, in this case it _is_ indexed, it just take
> a while). Ideally we'd just manage it all magically, but an alternative to
> let clients dynamically throttle themselves seems do-able.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]