[
https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382086#comment-14382086
]
Timothy Potter commented on SOLR-6816:
--------------------------------------
Coming back to this discussion ...
I still think there is need for a new optional parameter on an UpdateRequest
that specifies this request is a bulk add and the client application knows all
the docs in the request are either exact duplicates or new docs. You would use
this parameter for high-volume indexing jobs such as from Hadoop, Spark, or log
indexing applications. When this parameter is set to true (default is false of
course), we can skip the version lookup on replicas in the {{versionAdd}}
method of the {{DistributedUpdateProcessor}}, i.e.:
{code}
boolean bulkAdds =
cmd.getReq().getParams().getBool(UpdateRequest.BULK_ADD, false);
if (!bulkAdds) {
Long lastVersion = vinfo.lookupVersion(cmd.getIndexedId());
if (lastVersion != null && Math.abs(lastVersion) >=
versionOnUpdate) {
// This update is a repeat, or was reordered. We need to
drop this update.
return true;
}
}
{code}
I didn't think the {{lookupVersion}} would be that much of an overhead, but my
testing shows that it is, even when using docValues for the {{_version_}} field.
Using this bulk add parameter, I'm seeing very good improvements when using
replication. Specifically, here are the results I'm getting by making this
simple change:
Indexing 9,992,262 docs (~1k in size) in a 3-shard collection with RF=2 (I'm
using 6 r3.xlarge instances in EC2 so there is no contention between nodes,
i.e. all replicas are on different servers):
* baseline branch5x: 758 seconds, ~13,182 docs per second
* branch5x with fix for SOLR-6820 (65536 version buckets): 710 seconds, ~14,074
docs per second
* branch5x with fix for SOLR-6820 and this bulkAdds parameter: 485 seconds,
~20,603 docs per second
That's a 56% increase in throughput performance over the baseline in branch5x!
What's more is the 20,603 is nearing the performance I was getting in the
baseline without replication (23,401).
I don't think using {{overwrite=false}} will work here though because most apps
still want basic duplicate checking on the leader to catch duplicate documents
that get resent to Solr. For instance, imagine a Map/Reduce job that indexes
into Solr ... if a task fails, then Hadoop usually re-tries that task a couple
of times, meaning all docs in the block that failed will be sent again. If we
use {{overwrite=false}}, then you'll end up with dupes in your index. This is
why I think having an additional parameter that lets client apps tell Solr they
are doing bulk adds of new docs is required.
Lastly, I'm still working on a way to send less requests from leader to replica
when using batches. Just increasing the poll queue time for CUSS in
StreamingSolrClients imposes an unnecessary wait after the last doc in the
batched request is processed. So I'm trying to devise a way for the entire
batch of docs to be streamed to the replica without having this unnecessary
wait after the last doc.
> Review SolrCloud Indexing Performance.
> --------------------------------------
>
> Key: SOLR-6816
> URL: https://issues.apache.org/jira/browse/SOLR-6816
> Project: Solr
> Issue Type: Task
> Components: SolrCloud
> Reporter: Mark Miller
> Priority: Critical
> Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and
> low hanging fruit. We need to vet the performance and try to address any
> holes.
> Note: A common report is that adding any replication is very slow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]