Re: Index documents in async way

Joel Bernstein Thu, 08 Oct 2020 12:11:15 -0700

I think this model has a lot of potential.

I'd like to add another wrinkle to this. Which is to store the information
about each batch as a record in the index. Each batch record would contain
a fingerprint for the batch. This solves lots of problems, and allows us to
confirm the integrity of the batch. It also means that we can compare
indexes by comparing the batch fingerprints rather than building a
fingerprint from the entire index.



Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> I suppose failures would be returned to the client one the async response?
>
> How would one keep the tlog from growing forever if the actual indexing
> took a long time?
>
> I'm guessing that this would be optional..
>
> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <ichattopadhy...@gmail.com>
> wrote:
>
>> Can there be a situation where the index writer fails after the document
>> was added to tlog and a success is sent to the user? I think we want to
>> avoid such a situation, isn't it?
>>
>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <da...@apache.org> wrote:
>>
>>> > Can you explain a little more on how this would impact durability of
>>> updates?
>>> Since we persist updates into tlog, I do not think this will be an issue
>>>
>>> > What does a failure look like, and how does that information get
>>> propagated back to the client app?
>>> I did not be able to do much research but I think this is gonna be the
>>> same as the current way of our asyncId. In this case asyncId will be the
>>> version of an update (in case of distributed queue it will be offset)
>>> failures update will be put into a time-to-live map so users can query the
>>> failure, for success we can skip that by leverage the max succeeded version
>>> so far.
>>>
>>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <md...@apache.org> wrote:
>>>
>>>> Interesting idea! Can you explain a little more on how this would
>>>> impact durability of updates? What does a failure look like, and how does
>>>> that information get propagated back to the client app?
>>>>
>>>> Mike
>>>>
>>>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <da...@apache.org> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> First of all it seems that I used the term async a lot recently :D.
>>>>> Recently I have been thinking a lot about changing the current
>>>>> indexing model of Solr from sync way like currently (user submit an update
>>>>> request waiting for response). What about changing it to async model, 
>>>>> where
>>>>> nodes will only persist the update into tlog then return immediately much
>>>>> like what tlog is doing now. Then we have a dedicated executor which reads
>>>>> from tlog to do indexing (producer consumer model with tlog acting like 
>>>>> the
>>>>> queue).
>>>>>
>>>>> I do see several big benefits of this approach
>>>>>
>>>>>    - We can batching updates in a single call, right now we do not
>>>>>    use writer.add(documents) api from lucene, by batching updates this 
>>>>> gonna
>>>>>    boost the performance of indexing
>>>>>    - One common problems with Solr now is we have lot of threads
>>>>>    doing indexing so that can ends up with many small segments. Using this
>>>>>    model we can have bigger segments so less merge cost
>>>>>    - Another huge reason here is after switching to this model, we
>>>>>    can remove tlog and use a distributed queue like Kafka, Pulsar. Since 
>>>>> the
>>>>>    purpose of leader in SolrCloud now is ordering updates, the distributed
>>>>>    queue is already ordering updates for us, so no need to have a 
>>>>> dedicated
>>>>>    leader. That is just the beginning of things that we can do after 
>>>>> using a
>>>>>    distributed queue.
>>>>>
>>>>> What do your guys think about this? Just want to hear from your guys
>>>>> before going deep into this rabbit hole.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>

Re: Index documents in async way

Reply via email to