[ https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711955#comment-16711955 ]
Ben Manes commented on PDFBOX-4396: ----------------------------------- The process completed for one of the large uploads and I had to disable the others due to taking too long (hours). The cpu overhead on the machine caused bad user-facing latencies, since the scheduler doesn't take cpu into account and those jobs were being delayed. I think since our use cases expanded expecting 5-10 page documents to now many thousands of pages (monthly historicals), it's no longer a good fit to do the work on a single process, shared with other user-facing work. I think my next step should be to migrate this use-case to a lambda, distribute page ranges, and invoke in parallel. That could easily be distributed using pdfbox and work great, but it's probably easier / faster / cheaper to use ghostscript for such a simple lambda task. The documents are not encrypted so I think that case may not apply. In my code I often pass around a Guava Closer to accumulate resources across methods, and then ensure all are closed if not done so otherwise. If everything is associated to a document, it would make sense for a closer to be propagated from it and then it can close all of the resources (if not closed already). That could be a custom utility, etc. of course rather than Guava's. You might also considered using weak / phantom references instead of finalization. For my application's file I/O (local and s3), I give clients a session with their own tempdir and reference count downloaded files against a global cache. The session handles are proxies that clients should close, but held in a weak keyed cache where the actual implementation is the value. Then when the proxy is collected, the strong-ref value is explicitly closed. This acts as a safety net just in case, since we do a lot of I/O and this form of reference caching is cheap. The same can be done better with phantom references, but more work than spinning up a weak cache with a removal listener. From reading the code, it looks like a lot of effort was made to close resources but it also got really complex with patches for the inevitable leaks. Of course, you might not be able to change much due to API compatibility needs. I think at this point I'll close this, like the other, as not something trivially fixable. I do think better resource handing is warranted, but that requires a thoughtful refactor. > Memory leak due to soft reference caching > ----------------------------------------- > > Key: PDFBOX-4396 > URL: https://issues.apache.org/jira/browse/PDFBOX-4396 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.12 > Environment: JDK10; G1 > Reporter: Ben Manes > Priority: Major > Attachments: #2 - memory leak 2.png, #2 - memory leak.png, memory > leak 2.png, memory leak.png > > > In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of > memory due to buffered images (via PDImageXObject). I suspect that G1 is not > collecting soft references across all regions before it out-of-memory errors. > In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10 > I/O bug. Previously I was loading the document to render each page, but this > took 1.5 minutes. To work around that bug I reused the document instance > across pages. This seems to have fail because the pages were cached and not > cleared by the GC. > The DefaultResourceCache does not prune its cache entries when the soft > references are collected. Like WeakHashMap, it should use a ReferenceQueue, > poll it on every access, and prune accordingly. > Thankfully PDDocument#setResourceCache exists. For now I am going to reset > the cache to a new instance after a page has been rendered. The entries > should no longer be reachable and be GC'd more aggressively. If that doesn't > work, I'll either replace the cache (e.g. with Caffeine) or disable it by > setting the instance to null. > I think the desired fix is to prune the DefaultResourceCache and, ideally, > reconsider usage of soft references (as they tend to be poor in practice). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org