[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837740#action_12837740
 ] 

Michael McCandless commented on LUCENE-2283:
--------------------------------------------


TermVectorsTermsWriter has the same issue.

You're right: with "irregular" sized documents coming through, you can
end up with PerDoc instances that waste space, because the RAMFile has
buffers allocated from past huge docs that the latest tiny docs don't
use.

Note that the number of outstanding PerDoc instances is a function of
how "out of order" the docs are being indexed, because the PerDoc
holds any state only until that doc can be written to the store files
(stored fields, term vectors).  It's transient.

EG with a single thread, there will only be one PerDoc -- it's written
immediately.  With 2 threads, if you have a massive doc (which thread
1 get stuck indexing) and then zillions of tiny docs (which thread 2
burns through, while thread 1 is busy), then you can get a large
number of PerDocs created, waiting for their turn because thread 1
hasn't finished yet.

But this process won't use unbounded RAM -- the RAM used by the
RAMFiles is accounted for, and once it gets too high (10% of the RAM
buffer size), we forcefully idle the incoming threads until the "out
of orderness" is resolved.  EG in this case, thread 2 will stall until
thread 1 has finished its doc.  That byte accounting does account for
the allocated but not used byte[1024] inside RAMFile (we use
RAMFile.sizeInBytes()).

So... this is not really a memory leak.  But it is a potential
starvation issue, in that if your PerDoc instances all grow to large
RAMFiles over time (as each has had to service a very large document),
then it can mean the amount of concurrency that DW allows will become
"pinched".  Especially if these docs are large relative to your
ram buffer size.

Are you hitting this issue?  Ie seeing poor concurrency during
indexing despite using many threads, because DW is forcefully idleing
the threads?  It should only happen if you sometimes index docs
that are larger than RAMBufferSize/10/numberOrIndexingThreads.

I'll work out  a fix.  I think we should fix RAMFile.reset to trim its
buffers using ArrayUtil.getShrinkSize.


> Possible Memory Leak in StoredFieldsWriter
> ------------------------------------------
>
>                 Key: LUCENE-2283
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2283
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Tim Smith
>            Assignee: Michael McCandless
>
> StoredFieldsWriter creates a pool of PerDoc instances
> this pool will grow but never be reclaimed by any mechanism
> furthermore, each PerDoc instance contains a RAMFile.
> this RAMFile will also never be truncated (and will only ever grow) (as far 
> as i can tell)
> When feeding documents with large number of stored fields (or one large 
> dominating stored field) this can result in memory being consumed in the 
> RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
> large, even if large documents are rare.
> Seems like there should be some attempt to reclaim memory from the PerDoc[] 
> instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to