David Smiley created SOLR-9824:
----------------------------------

             Summary: Documents indexed in bulk are replicated using too many 
HTTP requests
                 Key: SOLR-9824
                 URL: https://issues.apache.org/jira/browse/SOLR-9824
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 6.3
            Reporter: David Smiley


This takes awhile to explain; bear with me. While working on bulk indexing 
small documents, I looked at the logs of my SolrCloud nodes.  I noticed that 
shards would see an /update log message every ~6ms which is *way* too much.  
These are requests from one shard (that isn't a leader/replica for these docs 
but the recipient from my client) to the target shard leader (no additional 
replicas).  One might ask why I'm not sending docs to the right shard in the 
first place; I have a reason but it's besides the point -- there's a real Solr 
perf problem here and this probably applies equally to replicationFactor>1 
situations too.  I could turn off the logs but that would hide useful stuff, 
and it's disconcerting to me that so many short-lived HTTP requests are 
happening, somehow at the bequest of DistributedUpdateProcessor.  After lots of 
analysis and debugging and hair pulling, I finally figured it out.  

In SOLR-7333 ([~tpot]) introduced an optimization called 
{{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will 
poll with a '0' timeout to the internal queue, so that it can close the 
connection without it hanging around any longer than needed.  This part makes 
sense to me.  Currently the only spot that has the smarts to set this flag is 
{{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the last 
document.  So if a shard received docs in a javabin stream (but not other 
formats) one would expect the _last_ document to have this flag.  There's even 
a test.  Docs without this flag get the default poll time; for javabin it's 
25ms.  Okay.

I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send 
javabin data in a batch, the intended efficiencies of SOLR-7333 would apply.  I 
didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW 
DistributedUpdateProcessor uses CUSC too).  CUSC uses the RequestWriter 
(defaulting to javabin) to send each document separately without any leading 
marker or trailing marker.  For the XML format by comparison, there is a 
leading and trailing marker (<stream> ... </stream>).  Since there's no outer 
container for the javabin unmarshalling to detect the last document, it marks 
_every_ document as {{req.lastDocInBatch()}}!  Ouch!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to