[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

GitBox Fri, 18 Nov 2022 05:03:19 -0800


rclabo commented on issue #763:
URL: https://github.com/apache/lucenenet/issues/763#issuecomment-1319966984

> After Merge() operation, we immediately collect the output files and store
them to cloud storage. Seems the Lucene Index is valid already and we can run
query against it successfully. So, there should be no required post write
operations.

I'm not sure I fully understand. When you call `IndexWriter.Commit` that
thread writes a new index segment to storage, handled at a top level by
`DocumentsWriterPerThread` (which I had forgot was introduced in Lucene 4).
That index segment on disk will be fairly small given that it had to fit in RAM
in it's entirety before being written to disk.

Then, in the default configuration, the
[ConcurrentMergeScheduler](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/core/Lucene.Net.Index.ConcurrentMergeScheduler.html)
will run on background threads and merge that tiny segment with other segments
to create a new larger segment. This process repeats itself over and over as
new documents are written to the index and committed.

[This video](https://www.youtube.com/watch?v=YW0bOvLp72E ) shows the
segment writing and merging that happens as Java Lucene indexes Wikipedia. It
creates a nice visual of the process we are talking about. You can see all the
small initial segments being written as various commits are called and then see
the segments getting combined into larger segments by the background workers.
And eventually those larger segments get combined to create even larger
segments and so on.

So when you say "After Merge() operation, we immediately collect the output
files and store them to cloud storage." That would only work well if there is
no need to add additional documents to the index. Otherwise, without ongoing
merging, the number of segments will grow very large and search times will
become slower and slower due to the number of segments that must be searched.

I hope this information is helpful.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [lucenenet] rclabo commented on issue #763: Support async task

Reply via email to