[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142197#comment-17142197 ]
Simon Willnauer commented on LUCENE-8962: ----------------------------------------- I attached a log file with a failure and index writer logging enabled. I think I have a smoking gun here: {noformat} [junit4] 1> IW 2 [2020-06-22T15:33:08.479370Z; Lucene Merge Thread #31]: merged segment _2f(9.0.0):c1:[diagnostics={os.version=10.14.6, mergeMaxNumSegments=-1, java.version=12.0.1, java.vm.version=12.0.1+12, lucene.version=9.0.0, timestamp=1592839988473, os=Mac OS X, java.runtime.version=12.0.1+12, mergeFactor=1, os.arch=x86_64, source=merge, java.vendor=Oracle Corporation}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}] :id=70ut1o7n9kgx798rria9sma3n is 100% deleted; skipping insert [junit4] 1> SM 2 [2020-06-22T15:33:08.479615Z; Lucene Merge Thread #32]: 0 msec to write field infos [29 docs] [junit4] 1> IFD 2 [2020-06-22T15:33:08.479662Z; Lucene Merge Thread #31]: will delete new file "_2f.cfe" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479712Z; Lucene Merge Thread #31]: will delete new file "_2f.cfs" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479740Z; Lucene Merge Thread #31]: will delete new file "_2f.si" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479770Z; Lucene Merge Thread #31]: delete [_2f.cfe, _2f.cfs, _2f.si] {noformat} and further down {noformat} [junit4] 2> Caused by: java.nio.file.NoSuchFileException: _2f.cfs [junit4] 2> at org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) [junit4] 2> at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) [junit4] 2> at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) [junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) [junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) [junit4] 2> at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589) [junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:382) [junit4] 2> at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:527) [junit4] 2> at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:3546) [junit4] 2> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3502) [junit4] 2> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3452) [junit4] 2> at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879) [junit4] 2> ... 1 more [junit4] 2> {noformat} I think we are not respecting the fact that it's fully deleted and dropped because of this before we get a chance to incRef the segment. I will work on a patch for this. > Can we merge small segments during refresh, for faster searching? > ----------------------------------------------------------------- > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 18h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org