[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846960#action_12846960 ]
Michael McCandless commented on LUCENE-2328: -------------------------------------------- {quote} bq. Keeping track of not-yet-sync'd files instead of sync'd files is better, but it still requires upkeep (ie when file is deleted you have to remove it) because files can be opened, written to, closed, deleted without ever being sync'd. You can just skip this and handle FileNotFound exception when syncing. Have to handle it anyway, no guarantees some file won't be snatched from under your nose. {quote} IW & IR do in fact guarantee they will never ask for a deleted file to be sync'd. If they ever do that we have more serious problems ;) {quote} bq. This will over-sync in some situations. Don't feel this is a serious problem. If you over-sync (in fact sync some files a little bit earlier than strictly required), in a few seconds you will under-sync, so total time is still the same. {quote} I think this is important -- commit is already slow enough -- why make it slower? Further, the extra files you sync'd may never have needed to be sync'd (they will be merged away). My examples above include such cases. Turning this around... what's so bad about keeping the sync per file? bq. System-wide sync is not the original aim, it's just a possible byproduct of what is the original aim I know this is not the aim of this issue, rather just a nice by-product if we switch to a "global sync" method. bq. to move sync tracking code from IW to Directory. Right this is a great step forward, as long as long as we don't slow commit by dumbing down the API :) bq. And I don't see at all how adding batch-syncs achieves this. You're right: this doesn't achieve / is not required for "moving sync'd file tracking" down to Dir. It's orthogonal, but, is another way that we could allow Dir impls to do global sync. I'm proposing this as a different change, to make the API better match the needs of its consumers. In fact, really the OS ought to allow for this as well (but I know of none that do) since it'd give the IO scheduler more freedom on which bytes need to be moved to disk. We can open this one as a separate issue... > IndexWriter.synced field accumulates data leading to a Memory Leak > ------------------------------------------------------------------- > > Key: LUCENE-2328 > URL: https://issues.apache.org/jira/browse/LUCENE-2328 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 > Environment: all > Reporter: Gregor Kaczor > Priority: Minor > Fix For: 3.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > I am running into a strange OutOfMemoryError. My small test application does > index and delete some few files. This is repeated for 60k times. Optimization > is run from every 2k times a file is indexed. Index size is 50KB. I did > analyze > the HeapDumpFile and realized that IndexWriter.synced field occupied more than > half of the heap. That field is a private HashSet without a getter. Its task > is > to hold files which have been synced already. > There are two calls to addAll and one call to add on synced but no remove or > clear throughout the lifecycle of the IndexWriter instance. > According to the Eclipse Memory Analyzer synced contains 32618 entries which > look like file names "_e065_1.del" or "_e067.cfs" > The index directory contains 10 files only. > I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org