[ 
https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710391#comment-16710391
 ] 

Ben Manes commented on PDFBOX-4396:
-----------------------------------

Yes, I agree caching makes sense in general. My case is extreme due to N 
thousand page documents from scanned paperwork, taking 3-13 seconds per page 
for PdfBox to render into an image. While I'd appreciate better performance, 
that's only if it retains stability.

I agree weak references are not a fit here and did not intend to imply 
otherwise. My point is that the cache held 430k HashMap.Entry objects where 
many might have null values. This can be pruned by using a ReferenceQueue, 
something like the below code.

Soft references are problematic and typically chosen because a developer 
doesn't know a good size. Instead of a strict limit, the decision is left to 
the JVM. The references are in a global cache, so an inexpensive cache might 
cause a critical one to be flushed. The collection behavior is GC specific and 
the penalty is placed in the critical section of the pause time. Many 
collectors are not aggressive, which increases hit rates but the memory 
pressure causes full GCs in short intervals. A collector that is aggressive 
makes the cache ineffective.

If there is a way to estimate the size, then a bounded cache is preferrable. 
This avoids the above problems with the potential of higher hit rates, as LRU 
can easily to polluted. See for example [Caffeine's hit 
rates|https://github.com/ben-manes/caffeine/wiki/Efficiency] by taking 
frequency into account, or our new [research 
paper|https://drive.google.com/file/d/1CT2ASkfuG9qVya9Sn8ZUCZjrFSSyjRA_/view?usp=sharing]
 for an adapting policy. If the number of entries or weight of an entry can be 
estimated then a strong reference cache is typically the preferred approach. If 
that is problematic, usually one has to investigate off-heap caching.

So far resetting the ResourceCache has been effective. I could try amortizing 
that, e.g. reseting it every N pages, to gain a little better reuse as you 
indicated. If I had a better sense of the objects being cached, I would switch 
to a Caffeine-backed version for an explicit bound. Can the ResourceCache be 
shared across documents or are the entries document specific?
{code:java}
final ReferenceQueue queue;
final Map<K, SoftValueReference<K, V>> cache;

public void put(K key, V value) {
  prune();
  cache.put(key, new SoftValueReference<>(key, value, queue));
}
public V get(K key) {
  prune();
  var ref = cache.get(key);
  return (ref == null) ? null : ref.get();
}
private void prune() {
  Reference<? extends V> ref;
  while ((ref = queue.poll()) != null) {
    var reference = (SoftValueReference<K, V>) ref;
    cache.remove(ref.getKey());
  }
}

static final class SoftValueReference<K, V> extends SoftReference<V> {
  private final K key;

  public SoftValueReference(K key, V value, ReferenceQueue<V> queue) {
    super(value, queue);
    this.key = key;
  }
  public Object getKey() {
    return key;
  }
}
{code}

> Memory leak due to soft reference caching
> -----------------------------------------
>
>                 Key: PDFBOX-4396
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4396
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>         Environment: JDK10; G1
>            Reporter: Ben Manes
>            Priority: Major
>         Attachments: memory leak 2.png, memory leak.png
>
>
> In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of 
> memory due to buffered images (via PDImageXObject). I suspect that G1 is not 
> collecting soft references across all regions before it out-of-memory errors.
> In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10 
> I/O bug. Previously I was loading the document to render each page, but this 
> took 1.5 minutes. To work around that bug I reused the document instance 
> across pages. This seems to have fail because the pages were cached and not 
> cleared by the GC.
> The DefaultResourceCache does not prune its cache entries when the soft 
> references are collected. Like WeakHashMap, it should use a ReferenceQueue, 
> poll it on every access, and prune accordingly.
> Thankfully PDDocument#setResourceCache exists. For now I am going to reset 
> the cache to a new instance after a page has been rendered. The entries 
> should no longer be reachable and be GC'd more aggressively. If that doesn't 
> work, I'll either replace the cache (e.g. with Caffeine) or disable it by 
> setting the instance to null.
> I think the desired fix is to prune the DefaultResourceCache and, ideally, 
> reconsider usage of soft references (as they tend to be poor in practice). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to