Not what I'm seeing. I actually see a lot of segments created and merged while it operates. Expected?
Reminding you, this is 2.9.4 / 3.0.3 On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless < [email protected]> wrote: > Right: Lucene never autocommits anymore ... > > If you create a new index, add a bunch of docs, and things crash > before you have a chance to commit, then there is no index (not even a > 0 doc one) in that directory. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <[email protected]> > wrote: > > I'm quite certain this shouldn't happen also when Commit wasn't called. > > > > Mike, can you comment on that? > > > > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens > > <[email protected]> wrote: > >> > >> Well, the only thing I see is that there is no place where > writer.Commit() > >> is called in the delegate assigned to corpusReader.OnDocument. I know > >> that > >> lucene is very transactional, and at least in 3.x, the writer will never > >> auto commit to the index. You can write millions of documents, but if > >> commit is never called, those documents aren't actually part of the > index. > >> Committing isn't a cheap operation, so you definitely don't want to do > it > >> on every document. > >> > >> You can test it yourself with this (naive) solution. Right below the > >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". At > >> the > >> end of the corpusReader.OnDocument delegate add: > >> > >> // Example only. I wouldn't suggest committing this often > >> if(++numDocsAdded % 5 == 0) > >> { > >> writer.Commit(); > >> } > >> > >> I had the application crash for real on this file: > >> > >> > http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 > , > >> about 20% into the operation. Without the commit, the index is empty. > >> Add > >> it in, and I get 755 files in the index after it crashes. > >> > >> > >> Thanks, > >> Christopher > >> > >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko > >> <[email protected]>wrote: > >> > >> > >> > Yes, reproduced in first try. See attached program - I referenced it > to > >> > current trunk. > >> > > >> > > >> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko > >> > <[email protected]>wrote: > >> > > >> >> Christopher, > >> >> > >> >> I used the IndexBuilder app from here > >> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThingswith a > >> >> 8.5GB wikipedia dump. > >> >> > >> >> After running for 2.5 days I had to forcefully close it (infinite > loop > >> >> in > >> >> the wiki-markdown parser at 92%, go figure), and the 40-something GB > >> >> index > >> >> I had by then was unusable. I then was able to reproduce this > >> >> > >> >> Please note I now added a few safe-guards you might want to remove to > >> >> make sure the app really crashes on process kill. > >> >> > >> >> I'll try to come up with a better way to reproduce this - hopefully > >> >> Mike > >> >> will be able to suggest better ways than manual process kill... > >> >> > >> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < > >> >> [email protected]> wrote: > >> >> > >> >>> Mike, The codebase for lucene.net should be almost identical to > java's > >> >>> 3.0.3 release, and LUCENE-1044 is included in that. > >> >>> > >> >>> Itamar, are you committing the index regularly? I only ask because > I > >> >>> can't > >> >>> reproduce it myself by forcibly terminating the process while it's > >> >>> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at > all > >> >>> and > >> >>> terminate the process (even with a 10,000 4K documents created), > there > >> >>> will > >> >>> be no documents in the index when I open it in luke, which I expect. > >> >>> If > >> >>> I > >> >>> commit at 10,000 documents, and terminate it a few thousand after > >> >>> that, > >> >>> the > >> >>> index has the first ten thousand that were committed. I've even > >> >>> terminated > >> >>> it *while* a second commit was taking place, and it still had all of > >> >>> the > >> >>> documents I expected. > >> >>> > >> >>> It may be that I'm not trying to reproducing it correctly. Do you > >> >>> have a > >> >>> minimal amount of code that can reproduce it? > >> >>> > >> >>> > >> >>> Thanks, > >> >>> Christopher > >> >>> > >> >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < > >> >>> [email protected]> wrote: > >> >>> > >> >>> > Hi Itamar, > >> >>> > > >> >>> > One quick question: does Lucene.Net include the fixes done for > >> >>> > LUCENE-1044 (to fsync files on commit)? Those are very important > >> >>> > for > >> >>> > an index to be intact after OS/JVM crash or power loss. > >> >>> > > >> >>> > More responses below: > >> >>> > > >> >>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko < > >> >>> [email protected]> > >> >>> > wrote: > >> >>> > > >> >>> > > I'm a Lucene.Net committer, and there is a chance we have a bug > in > >> >>> our > >> >>> > > FSDirectory implementation that causes indexes to get corrupted > >> >>> > > when > >> >>> > > indexing is cut while the IW is still open. As it roots from > some > >> >>> > > retroactive fixes you made, I'd appreciate your feedback. > >> >>> > > > >> >>> > > Correct me if I'm wrong, but by design Lucene should be able to > >> >>> recover > >> >>> > > rather quickly from power failures or app crashes. Since > existing > >> >>> segment > >> >>> > > files are read only, only new segments that are still being > >> >>> > > written > >> >>> can > >> >>> > get > >> >>> > > corrupted. Hence, recovering from worst-case scenarios is done > by > >> >>> simply > >> >>> > > removing the write.lock file. The worst that could happen then > is > >> >>> having > >> >>> > the > >> >>> > > last segment damaged, and that can be fixed by removing those > >> >>> > > files, > >> >>> > > possibly by running CheckIndex on the index. > >> >>> > > >> >>> > You shouldn't even have to run CheckIndex ... because (as of > >> >>> > LUCENE-1044) we now fsync all segment files before writing the new > >> >>> > segments_N file, and then removing old segments_N files (and any > >> >>> > segments that are no longer referenced). > >> >>> > > >> >>> > You do have to remove the write.lock if you aren't using > >> >>> > NativeFSLockFactory (but this has been the default lock impl for a > >> >>> > while now). > >> >>> > > >> >>> > > Last week I have been playing with rather large indexes and > >> >>> > > crashed > >> >>> my > >> >>> > app > >> >>> > > while it was indexing. I wasn't able to open the index, and Luke > >> >>> > > was > >> >>> even > >> >>> > > kind enough to wipe the index folder clean even though I opened > it > >> >>> > > in > >> >>> > > read-only mode. I re-ran this, and after another crash running > >> >>> CheckIndex > >> >>> > > revealed nothing - the index was detected to be an empty one. I > am > >> >>> not > >> >>> > > entirely sure what could be the cause for this, but I suspect it > >> >>> > > has > >> >>> > > been corrupted by the crash. > >> >>> > > >> >>> > Had no commit completed (no segments file written)? > >> >>> > > >> >>> > If you don't fsync then all sorts of crazy things are possible... > >> >>> > > >> >>> > > I've been looking at these: > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > >> >>> > https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > >> >>> > > > >> >>> > > >> >>> > >> >>> > https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > >> >>> > > >> >>> > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE- > 2328 > >> >>> > broke...). > >> >>> > > >> >>> > > And it seems like this is what I was experiencing. Mike and Mark > >> >>> > > will > >> >>> > > probably be able to tell if this is what they saw or not, but as > >> >>> > > far > >> >>> as I > >> >>> > > can tell this is not an expected behavior of a Lucene index. > >> >>> > > >> >>> > Definitely not expected behavior: assuming nothing is flipping > bits, > >> >>> > then on OS/JVM crash or power loss your index should be fine, just > >> >>> > reverted to the last successful commit. > >> >>> > > >> >>> > > What I'm looking for at the moment is some advice on what > >> >>> > > FSDirectory > >> >>> > > implementation to use to make sure no corruption can happen. The > >> >>> > > 3.4 > >> >>> > version > >> >>> > > (which is where LUCENE-3418 was committed to) seems to handle a > >> >>> > > lot > >> >>> of > >> >>> > > things the 3.0 doesn't, but on the other hand LUCENE-3418 was > >> >>> introduced > >> >>> > by > >> >>> > > changes made to the 3.0 codebase. > >> >>> > > >> >>> > Hopefully it's just that you are missing fsync! > >> >>> > > >> >>> > > Also, is there any test in the suite checking for those > scenarios? > >> >>> > > >> >>> > Our test framework has a sneaky MockDirectoryWrapper that, after a > >> >>> > test finishes, goes and corrupts any unsync'd files and then > >> >>> > verifies > >> >>> > the index is still OK... it's good because it'll catch any times > we > >> >>> > are missing calls t sync, but, it's not low level enough such that > >> >>> > if > >> >>> > FSDir is failing to actually call fsync (that wsa the bug in > >> >>> > LUCENE-3418) then it won't catch that... > >> >>> > > >> >>> > Mike McCandless > >> >>> > > >> >>> > http://blog.mikemccandless.com > >> >>> > > >> >>> > >> >> > >> >> > >> > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > >
