[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541576 ]
Michael McCandless commented on LUCENE-1044: -------------------------------------------- {quote} Was that compound or non-compound index format? I imagine non-compound will take a bigger hit since each file will be synchronized separately and in a serialized fashion. {quote} The test was with compound file. But, the close() on each component file that goes into the compound file also does a sync, so compound file would be a slightly bigger hit because it has one additional sync()? We can't safely remove the sync() on each component file before building the compound file because we currently do a commit of the new segments file before building the compound file. I guess we could revisit whether that commit (before building the compound file) is really necessary? I think it's there from when flushing & merging were the same thing, and you do want to do this when merging to save 1X extra peak on the disk usage, but now that flushing is separate from merging we could remove that intermediate commit? {quote} I also imagine that the hit will be larger for a weaker disk subsystem, and for usage patterns that continually add a few docs and close? {quote} OK I'll run the same test, but once on a laptop and once over NFS to see what the cost is for those cases. Yes, continually adding docs & flushing/closing your writer will in theory be most affected here. I think for such apps performance is not usually top priority (indexing latency is)? Ie if you wanted performance you would batch up the added docs more? Anyway, for such cases users can turn off sync() if they want to risk it? {quote} Is a sync before every file close really needed, or can some of them be avoided when autocommit==false? {quote} It's somewhat tricky to safely remove sync() even when autoCommit=false, because you don't know at close() whether this file you are closing will be referenced (and not merged away) when the commit is finally done (when IndexWriter is closed). If there were a way to sync a file after having closed it (is there?) then we could go and sync() all new files we had created that are now referenced by the segments file we are writing. Also, I was thinking we could start simple (call sync() before every close()) and then with time, and if necessary, work out smarter ways to safely remove some of those sync()'s. {quote} Also, the 'sync' should be optional. BerkleyDB offers similar functionality. {quote} It is optional: I added doSync boolean to FSDirectory.getDirectory(...). And, I agree: for cases where there is very low cost to regenerate the index, and you want absolute best performance, you can turn off syncing. > Behavior on hard power shutdown > ------------------------------- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 > Reporter: venkat rangan > Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, > LUCENE-1044.take3.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]