[
https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013047#comment-16013047
]
Tomás Fernández Löbbe commented on SOLR-9824:
---------------------------------------------
I've seen tests hang too, mostly in {{waitForEmptyQueue()}}. Solr gets to a
situation where the queue has one element, there is one runner and the
scheduler is shutdown. I have only seen this when the test is shutting down.
I added this change inside that loop:
{code:java}
if (scheduler.isTerminated()) {
log.warn("The task queue still has elements but the update scheduler {} is
terminated. Can't process any more tasks. "
+ "Queue size: {}, Runners: {}. Current thread Interrupted? {}",
scheduler, queue.size(), runners.size(), threadInterrupted);
break;
}
{code}
and I have seen no more hungs in my tests (after 300+ runs). I checked the
Thread dumps when this happens and there are no CUSC.Runner running. I believe
the problem is that in {{addRunner}} we should remove the Runner from the
runners list if scheduler.execute(r) fails. Something like:
{code:java}
Runner r = new Runner();
runners.add(r);
try {
scheduler.execute(r); // this can throw an exception if the scheduler has
been shutdown, but that should be fine.
} catch (RuntimeException e) {
runners.remove(r);
throw e;
}
{code}
> Documents indexed in bulk are replicated using too many HTTP requests
> ---------------------------------------------------------------------
>
> Key: SOLR-9824
> URL: https://issues.apache.org/jira/browse/SOLR-9824
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 6.3
> Reporter: David Smiley
> Assignee: Mark Miller
> Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch,
> SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch
>
>
> This takes awhile to explain; bear with me. While working on bulk indexing
> small documents, I looked at the logs of my SolrCloud nodes. I noticed that
> shards would see an /update log message every ~6ms which is *way* too much.
> These are requests from one shard (that isn't a leader/replica for these docs
> but the recipient from my client) to the target shard leader (no additional
> replicas). One might ask why I'm not sending docs to the right shard in the
> first place; I have a reason but it's besides the point -- there's a real
> Solr perf problem here and this probably applies equally to
> replicationFactor>1 situations too. I could turn off the logs but that would
> hide useful stuff, and it's disconcerting to me that so many short-lived HTTP
> requests are happening, somehow at the bequest of DistributedUpdateProcessor.
> After lots of analysis and debugging and hair pulling, I finally figured it
> out.
> In SOLR-7333 ([~tpot]) introduced an optimization called
> {{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will
> poll with a '0' timeout to the internal queue, so that it can close the
> connection without it hanging around any longer than needed. This part makes
> sense to me. Currently the only spot that has the smarts to set this flag is
> {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the
> last document. So if a shard received docs in a javabin stream (but not
> other formats) one would expect the _last_ document to have this flag.
> There's even a test. Docs without this flag get the default poll time; for
> javabin it's 25ms. Okay.
> I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send
> javabin data in a batch, the intended efficiencies of SOLR-7333 would apply.
> I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW
> DistributedUpdateProcessor uses CUSC too). CUSC uses the RequestWriter
> (defaulting to javabin) to send each document separately without any leading
> marker or trailing marker. For the XML format by comparison, there is a
> leading and trailing marker (<stream> ... </stream>). Since there's no outer
> container for the javabin unmarshalling to detect the last document, it marks
> _every_ document as {{req.lastDocInBatch()}}! Ouch!
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]