Re: Index documents in async way

David Smiley Fri, 09 Oct 2020 21:41:32 -0700

On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[email protected]> wrote:


> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current indexing
> model of Solr from sync way like currently (user submit an update request
> waiting for response). What about changing it to async model, where nodes
> will only persist the update into tlog then return immediately much like
> what tlog is doing now. Then we have a dedicated executor which reads from
> tlog to do indexing (producer consumer model with tlog acting like the
> queue).
>

The biggest problem I have with this is that the client doesn't know about
indexing problems without awkward callbacks later to see if something went
wrong.  Even simple stuff like a schema problem (e.g. undefined field).
It's a useful *option*, any way.


>
> I do see several big benefits of this approach
>
>    - We can batching updates in a single call, right now we do not use
>    writer.add(documents) api from lucene, by batching updates this gonna boost
>    the performance of indexing
>
> I'm a bit skeptical that would boost indexing performance.  Please
understand the intent of that API is about transactionality (atomic add)
and ensuring all docs go in the same segment.  Solr *does* use that API for
nested / parent-child documents, and because it has to.  If that API were
to get called for normal docs, I could see the configured indexing buffer
RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
You could test your performance theory on a hacked Solr without much
modifications, I think?  Just buffer then send in bulk.

>
>    - One common problems with Solr now is we have lot of threads doing
>    indexing so that can ends up with many small segments. Using this model we
>    can have bigger segments so less merge cost
>
> This is app/use-case dependent of course.  If you observe the segment
count to be high, I think it's more likely due to a sub-optimal commit
strategy.  Many users should put more thought into this.  If update
requests to Solr have a small number of docs and it explicitly gives a
commit (commitWithin on the other hand is fine), then this is a recipe for
small segments and is generally expensive as well (commits are kinda
expensive).  Many apps would do well to either pass commitWithin or rely on
a configured commitWithin, accomplishing the same instead of commit.  For
apps that can't do that (e.g. need to immediately read-your-write or for
other reasons where I work), then such an app can't use async any way.

An idea I've had to help throughput for this case would be for commits that
are about to happen concurrently with other indexing to voluntarily wait a
short period of time (500ms?) in an attempt to coalesce the commit needs of
both concurrent indexing requests.  Very opt-in, probably a solrconfig
value, and probably a wait time in the 100s of milliseconds range.  An
ideal implementation would be conservative to avoid this waiting if there
is no concurrent indexing request that did not start after the current
request or that which doesn't require a commit as well.

If your goal is fewer segments, then definitely check out the recent
improvements to Lucene to do some lightweight merge-on-commit.  The
Solr-side hook is SOLR-14582
<https://issues.apache.org/jira/browse/SOLR-14582> and it requires a custom
MergePolicy.  I contributed such a MergePolicy policy here:
https://github.com/apache/lucene-solr/pull/1222 although it needs to be
updated in light of Lucene changes that occured since then.  We've been
using that MergePolicy at Salesforce for a couple years and it has cut our
segment count in half!  Of course if you can generate fewer segments in the
first place, that's preferable and is more to your point.

>
>    - Another huge reason here is after switching to this model, we can
>    remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>    purpose of leader in SolrCloud now is ordering updates, the distributed
>    queue is already ordering updates for us, so no need to have a dedicated
>    leader. That is just the beginning of things that we can do after using a
>    distributed queue.
>
> I think that's the real thrust of your motivations, and sounds good to
me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for
making the optionality of the updateLog be a better supportable option in
SolrCloud.


> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>

Re: Index documents in async way

Reply via email to