[ 
https://issues.apache.org/jira/browse/SOLR-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913867#action_12913867
 ] 

Jan Høydahl commented on SOLR-1924:
-----------------------------------

I'm very much in favour of a robust way of knowing whether your docs are secure.

Building a persistent input document queue introduces a bit too much 
overhead+latency for NRT cases, and should be avoided.

One way to go could be ACK through polling.
1. Client performs an ADD (or other operation), with new optional parameters 
clientId="myClient", batchId="fooBar0001"
2. Solr adds batch (group of docs) in memory and adds the batchId to a 
AckPendingQueue
3. Client performs other requests, with or without explicit commit's, each with 
unique batchIds
4. At every successful COMMIT, i.e. segment secured on disk, Solr moves all 
associated batchIds to an AckPersistedQueue
5. Client polls as often it likes with a new operation <STATUS 
clientId="myClient">
6. The response from Solr is a list of pending and persisted batchIds for this 
clientId, 
    e.g. <persisted count="2">fooBar0001 fooBar0002</persisted> <pending 
count="1">fooBar0003</pending>
7. Client can now update its state accordingly and know for sure it does not 
need to resubmit the persisted batches

In case of Solr restart or some error, all batchIds would disappear from both 
pending and persisted queues and the client can detect it's lost when parsing 
the next STATUS response, and resubmit the lost batches.
The overhead of this approach is maintaining the status memory queues on the 
indexers. Entries should expire after some time or queue size.

Besides, in SolrJ we could then even implement auto polling, auto batch 
resubmit, auto feed speed throttling if #pending>someMaxValue, and give true 
callbacks to client code such as batchPersisted(), batchLost(), serverDown(), 
serverUp()... This will offload much complexity from the shoulders of the API 
consumer.

Do you see any glitches in this general approach?

> Solr's updateRequestHandler does not have a fast way of guaranteeing document 
> delivery
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-1924
>                 URL: https://issues.apache.org/jira/browse/SOLR-1924
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Karl Wright
>
> It is currently not possible, without performing a commit on every document, 
> to use updateRequestHandler to guarantee delivery into the index of any 
> document.  The reason is that whenever Solr is restarted, some or all 
> documents that have not been committed yet are dropped on the floor, and 
> there is no way for a client of updateRequestHandler to know which ones this 
> happened to.
> I believe it is not even possible to write a middleware-style layer that 
> stores documents and performs periodic commits on its own, because the update 
> request handler never ACKs individual documents on a commit, but merely 
> everything it has seen since the last time Solr bounced.  So you have this 
> potential scenario:
> - middleware layer receives document 1, saves it
> - middleware layer receives document 2, saves it
> Now it's time for the commit, so:
> - middleware layer sends document 1 to updateRequestHandler
> - solr is restarted, dropping all uncommitted documents on the floor
> - middleware layer sends document 2 to updateRequestHandler
> - middleware layer sends COMMIT to updateRequestHandler, but solr adds only 
> document 2 to the index
> - middleware believes incorrectly that it has successfully committed both 
> documents
> An ideal solution would be for Solr to separate the semantics of commit (the 
> index building variety) from the semantics of commit (the 'I got the 
> document' variety).  Perhaps this will involve a persistent document queue 
> that will persist over a Solr restart.
> An alternative mechanism might be for updateRequestHandler to acknowledge 
> specifically committed documents in its response to an explicit commit.  But 
> this would make it difficult or impossible to use autocommit usefully in such 
> situations.  The only other alternative is to require clients that need 
> guaranteed delivery to commit on every document, with a considerable 
> performance penalty.
> This ticket is related to LCF in that LCF is one of the clients that really 
> needs some kind of guaranteed delivery mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to