[jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

David Smiley (Jira) Wed, 04 Nov 2020 14:42:19 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226412#comment-17226412
 ]


David Smiley commented on SOLR-14923:
-------------------------------------

I've had trouble prioritizing this because it requires many hours to 
investigate through code I don't like.  I'll try to give you some answers 
without (yet) really digging in:

bq.  Would it be sufficient to track the document ids which require a reload 
and clear them on each openRealTimeSearcher call?

Where would the ID tracking you refer to _go_ (whose responsibility is it)?  I 
don't think UpdateLog.  
org.apache.solr.update.processor.DistributedUpdateProcessor is doing a lot 
already.  Thinking back to my suggestion back on SOLR-12638, I think I was 
referring RTGComponent because Mosh said that this guy was the thing that was 
involved for this use-case.  And I was not imagining tracking an ever growing 
list of IDs somewhere; I think just some sort of dirty flag on RTGComponent.  
See the variable "mustUseRealtimeSearcher" there -- maybe we could make it get 
and clear some AtomicReference<Boolean> or something.  It's worth a shot but it 
feels inelegant... I lack the deeper understanding as to why 
UpdateLog.openRealtimeSearcher must be called at all.  Mosh at the time said 
"RTGComponent is not aware of the newly indexed yet not committed child docs.". 
 This is foggy to me but I don't know why RTGComponent should be aware at all; 
I don't recall how RTGComponent is involved in the whole thing.  Maybe between 
you and me, we shall figure this out :-)

[~markrmil...@gmail.com]: AFAICT you originally added 
{{UpdateLog.openRealtimeSearcher}}.  Why is it located _there_ instead of, say, 
UpdateHandler? I'm honestly confused that UpdateLog refers to the index 
altogether; it should be independent according to my conceptual understanding.  
When there isn't an updateLog (it's technically optional), then there may be a 
bug because the reader probably needs to be re-opened still.

bq. What should be the result of two concurrent updates on the same document?  
I think it is the same as with normal atomic updates, and due the the fact the 
there is no rollback on transactions this can only be detected by versioning.

Yes; that's logical to me.

> Indexing performance is unacceptable when child documents are involved
> ----------------------------------------------------------------------
>
>                 Key: SOLR-14923
>                 URL: https://issues.apache.org/jira/browse/SOLR-14923
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update, UpdateRequestProcessors
>    Affects Versions: master (9.0), 8.3, 8.4, 8.5, 8.6
>            Reporter: Thomas Wöckinger
>            Priority: Critical
>              Labels: performance
>
> Parallel indexing does not make sense at moment when child documents are used.
> The org.apache.solr.update.processor.DistributedUpdateProcessor checks at the 
> end of the method doVersionAdd if Ulog caches should be refreshed.
> This check will return true if any child document is included in the 
> AddUpdateCommand.
> If so ulog.openRealtimeSearcher(); is called, this call is very expensive, 
> and executed in a synchronized block of the UpdateLog instance, therefore all 
> other operations on the UpdateLog are blocked too.
> Because every important UpdateLog method (add, delete, ...) is done using a 
> synchronized block almost each operation is blocked.
> This reduces multi threaded index update to a single thread behavior.
> The described behavior is not depending on any option of the UpdateRequest, 
> so it does not make any difference if 'waitFlush', 'waitSearcher' or 
> 'softCommit'  is true or false.
> The described behavior makes the usage of ChildDocuments useless, because the 
> performance is unacceptable.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

Reply via email to