[ 
https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382086#comment-14382086
 ] 

Timothy Potter commented on SOLR-6816:
--------------------------------------

Coming back to this discussion ...

I still think there is need for a new optional parameter on an UpdateRequest 
that specifies this request is a bulk add and the client application knows all 
the docs in the request are either exact duplicates or new docs. You would use 
this parameter for high-volume indexing jobs such as from Hadoop, Spark, or log 
indexing applications. When this parameter is set to true (default is false of 
course), we can skip the version lookup on replicas in the {{versionAdd}} 
method of the {{DistributedUpdateProcessor}}, i.e.:

{code}
              boolean bulkAdds = 
cmd.getReq().getParams().getBool(UpdateRequest.BULK_ADD, false);
              if (!bulkAdds) {
                Long lastVersion = vinfo.lookupVersion(cmd.getIndexedId());
                if (lastVersion != null && Math.abs(lastVersion) >= 
versionOnUpdate) {
                  // This update is a repeat, or was reordered.  We need to 
drop this update.
                  return true;
                }
              }
{code}

I didn't think the {{lookupVersion}} would be that much of an overhead, but my 
testing shows that it is, even when using docValues for the {{_version_}} field.

Using this bulk add parameter, I'm seeing very good improvements when using 
replication. Specifically, here are the results I'm getting by making this 
simple change:

Indexing 9,992,262 docs (~1k in size) in a 3-shard collection with RF=2 (I'm 
using 6 r3.xlarge instances in EC2 so there is no contention between nodes, 
i.e. all replicas are on different servers):

* baseline branch5x: 758 seconds, ~13,182 docs per second
* branch5x with fix for SOLR-6820 (65536 version buckets): 710 seconds, ~14,074 
docs per second
* branch5x with fix for SOLR-6820 and this bulkAdds parameter: 485 seconds, 
~20,603 docs per second

That's a 56% increase in throughput performance over the baseline in branch5x! 
What's more is the 20,603 is nearing the performance I was getting in the 
baseline without replication (23,401).

I don't think using {{overwrite=false}} will work here though because most apps 
still want basic duplicate checking on the leader to catch duplicate documents 
that get resent to Solr. For instance, imagine a Map/Reduce job that indexes 
into Solr ... if a task fails, then Hadoop usually re-tries that task a couple 
of times, meaning all docs in the block that failed will be sent again. If we 
use {{overwrite=false}}, then you'll end up with dupes in your index. This is 
why I think having an additional parameter that lets client apps tell Solr they 
are doing bulk adds of new docs is required.

Lastly, I'm still working on a way to send less requests from leader to replica 
when using batches. Just increasing the poll queue time for CUSS in 
StreamingSolrClients imposes an unnecessary wait after the last doc in the 
batched request is processed. So I'm trying to devise a way for the entire 
batch of docs to be streamed to the replica without having this unnecessary 
wait after the last doc.

> Review SolrCloud Indexing Performance.
> --------------------------------------
>
>                 Key: SOLR-6816
>                 URL: https://issues.apache.org/jira/browse/SOLR-6816
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Priority: Critical
>         Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and 
> low hanging fruit. We need to vet the performance and try to address any 
> holes.
> Note: A common report is that adding any replication is very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to