[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

David Smiley (JIRA) Thu, 08 Dec 2016 22:02:30 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734440#comment-15734440
 ]


David Smiley commented on SOLR-9824:
------------------------------------

I like the idea of removing lastDoc; it's hard to work with.

I'm not sure why we even need a longLastingThreads flag.  Can't we just ensure 
that blockUntilFinished() triggers the runners to interrupt their poll's 
always?  You're doing this now in the patch (albeit only when 
longLastingThreads==true) but why not simply always?  And why are Future's 
needed to do the cancel() when we can interrupt the Threads directly? We could 
expose them easily by having the Runner store it's current thread when run() is 
called.  If we didn't need a "longLastingThreads" boolean then the client could 
set the poll time independently, perhaps defaulting to '0' but might set it to 
be very long if it intends to call blockUntilFinished().  Arguably, 
blockUntilFinished() might log a warning if the poll time is zero because it 
would amount to misconfiguration.

It may be safer to ensure that interrupt() *only* affects the queue.poll calls 
and not anything else; but I'm unsure if anything else internal to writing the 
document to the stream would interrupt/cancel part way to warrant caring.  Do 
you know?  It's do-able but would require some extra boolean volatile state 
variables like doStop and currentlyPolling.

I'm confused about something: the first line of sendUpdateStream() is 
{{while(!queue.isEmpty)}} but why even do that given that given we poll the 
queue?  i.e. why not {{while(true)}}?  Or perhaps why even loop at all given 
the caller has a similar loop.

> Documents indexed in bulk are replicated using too many HTTP requests
> ---------------------------------------------------------------------
>
>                 Key: SOLR-9824
>                 URL: https://issues.apache.org/jira/browse/SOLR-9824
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.3
>            Reporter: David Smiley
>         Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch
>
>
> This takes awhile to explain; bear with me. While working on bulk indexing 
> small documents, I looked at the logs of my SolrCloud nodes.  I noticed that 
> shards would see an /update log message every ~6ms which is *way* too much.  
> These are requests from one shard (that isn't a leader/replica for these docs 
> but the recipient from my client) to the target shard leader (no additional 
> replicas).  One might ask why I'm not sending docs to the right shard in the 
> first place; I have a reason but it's besides the point -- there's a real 
> Solr perf problem here and this probably applies equally to 
> replicationFactor>1 situations too.  I could turn off the logs but that would 
> hide useful stuff, and it's disconcerting to me that so many short-lived HTTP 
> requests are happening, somehow at the bequest of DistributedUpdateProcessor. 
>  After lots of analysis and debugging and hair pulling, I finally figured it 
> out.  
> In SOLR-7333 ([~tpot]) introduced an optimization called 
> {{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will 
> poll with a '0' timeout to the internal queue, so that it can close the 
> connection without it hanging around any longer than needed.  This part makes 
> sense to me.  Currently the only spot that has the smarts to set this flag is 
> {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the 
> last document.  So if a shard received docs in a javabin stream (but not 
> other formats) one would expect the _last_ document to have this flag.  
> There's even a test.  Docs without this flag get the default poll time; for 
> javabin it's 25ms.  Okay.
> I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send 
> javabin data in a batch, the intended efficiencies of SOLR-7333 would apply.  
> I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW 
> DistributedUpdateProcessor uses CUSC too).  CUSC uses the RequestWriter 
> (defaulting to javabin) to send each document separately without any leading 
> marker or trailing marker.  For the XML format by comparison, there is a 
> leading and trailing marker (<stream> ... </stream>).  Since there's no outer 
> container for the javabin unmarshalling to detect the last document, it marks 
> _every_ document as {{req.lastDocInBatch()}}!  Ouch!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

Reply via email to