[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837740#action_12837740 ]
Michael McCandless commented on LUCENE-2283: -------------------------------------------- TermVectorsTermsWriter has the same issue. You're right: with "irregular" sized documents coming through, you can end up with PerDoc instances that waste space, because the RAMFile has buffers allocated from past huge docs that the latest tiny docs don't use. Note that the number of outstanding PerDoc instances is a function of how "out of order" the docs are being indexed, because the PerDoc holds any state only until that doc can be written to the store files (stored fields, term vectors). It's transient. EG with a single thread, there will only be one PerDoc -- it's written immediately. With 2 threads, if you have a massive doc (which thread 1 get stuck indexing) and then zillions of tiny docs (which thread 2 burns through, while thread 1 is busy), then you can get a large number of PerDocs created, waiting for their turn because thread 1 hasn't finished yet. But this process won't use unbounded RAM -- the RAM used by the RAMFiles is accounted for, and once it gets too high (10% of the RAM buffer size), we forcefully idle the incoming threads until the "out of orderness" is resolved. EG in this case, thread 2 will stall until thread 1 has finished its doc. That byte accounting does account for the allocated but not used byte[1024] inside RAMFile (we use RAMFile.sizeInBytes()). So... this is not really a memory leak. But it is a potential starvation issue, in that if your PerDoc instances all grow to large RAMFiles over time (as each has had to service a very large document), then it can mean the amount of concurrency that DW allows will become "pinched". Especially if these docs are large relative to your ram buffer size. Are you hitting this issue? Ie seeing poor concurrency during indexing despite using many threads, because DW is forcefully idleing the threads? It should only happen if you sometimes index docs that are larger than RAMBufferSize/10/numberOrIndexingThreads. I'll work out a fix. I think we should fix RAMFile.reset to trim its buffers using ArrayUtil.getShrinkSize. > Possible Memory Leak in StoredFieldsWriter > ------------------------------------------ > > Key: LUCENE-2283 > URL: https://issues.apache.org/jira/browse/LUCENE-2283 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Tim Smith > Assignee: Michael McCandless > > StoredFieldsWriter creates a pool of PerDoc instances > this pool will grow but never be reclaimed by any mechanism > furthermore, each PerDoc instance contains a RAMFile. > this RAMFile will also never be truncated (and will only ever grow) (as far > as i can tell) > When feeding documents with large number of stored fields (or one large > dominating stored field) this can result in memory being consumed in the > RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very > large, even if large documents are rare. > Seems like there should be some attempt to reclaim memory from the PerDoc[] > instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org