Kiyotsuki Suzuki created PDFBOX-6175:
----------------------------------------

             Summary: OutOfMemoryError parsing large CID fonts: soft-reference 
font cache cleared + W2 range expansion leads to OOM
                 Key: PDFBOX-6175
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6175
             Project: PDFBox
          Issue Type: Bug
          Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
    Affects Versions: 3.0.4 JBIG2, 3.0.7 PDFBox, 3.0.6 PDFBox, 3.0.5 PDFBox
         Environment: openjdk 21.0.7 2025-04-15
OpenJDK Runtime Environment Homebrew (build 21.0.7)
OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)

pdfbox version:3.0.4
            Reporter: Kiyotsuki Suzuki
             Fix For: 3.0.4 JBIG2
         Attachments: 6.pdf

### Description
When processing PDFs that contain large CID fonts (many CIDs and/or wide W2 
ranges), PDFBox can run into java.lang.OutOfMemoryError during font parsing / 
text extraction even with modest heap sizes. 

### Observed symptom
- OOM occurs while creating PDFont / PDCIDFont instances during text extraction.
- Problem appears when fonts are embedded as non-indirect objects and when W2 
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).

### Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
clears soft references, causing cached font objects to be discarded. Subsequent 
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle 
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID 
HashMap entries (boxed Integer/Float and Vector objects). A single large range 
(e.g. 0..16000) causes creation of tens of thousands of objects and large 
HashMap memory overhead, causing immediate heap exhaustion.

### Suggested fixes (implementation-level guidance)
- Avoid relying on SoftReference for the per-resource direct font cache for 
non-indirect embedded fonts. Use strong references scoped to PDResources (or 
make the behavior configurable). PDResources is freed with the document 
lifecycle, so strong references prevent repeated re-parsing without leaking 
across documents.
- Do not expand large W2 ranges into individual boxed map entries. Parse and 
store W2 ranges compactly (e.g. range list with primitive arrays or small 
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
use a compact index). This avoids creating thousands of Integer/Float/Vector 
objects for wide ranges.
- Add tests exercising large CID fonts with wide W2 ranges to guard against 
regressions, and add a memory-use test if possible.

### Why fix should be upstream
- This is a parser/runtime efficiency bug that affects robustness for 
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
large allocations across all users.

---

### Example

### 
[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
         // we don't actually read the complete table here because it can 
contain tens of thousands of glyphs
         // cache the relevant part of the font data so that the data stream 
can be closed if it is no longer needed
         byte[] dataBytes = data.read((int) getLength());
+        Runtime rt = Runtime.getRuntime();
+        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
free=%6.1fMB  used=%6.1fMB%n",
+            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         try (RandomAccessReadBuffer read = new 
RandomAccessReadBuffer(dataBytes))
         {
             this.data = new RandomAccessReadDataStream(read);
```

This is a log that prints the read size and the JVM memory state (free/used) 
immediately after loading the byte array of the GlyphTable, in order to 
visualize how much memory was consumed and whether the heap is becoming 
constrained.

In this investigation, it helped confirm that “the heap decreases with each 
load and then recovers after GC,” which was useful for tracing the cause of the 
OutOfMemoryError (OOM).


---

### 
[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java)
 
```diff
@@ -17,7 +17,6 @@
 package org.apache.pdfbox.pdmodel;
 
 import java.io.IOException;
-import java.lang.ref.SoftReference;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.Map;
@@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
 import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
 import org.apache.pdfbox.pdmodel.font.PDFont;
 import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
 import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
 import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
 
 /**
  * A set of resources available at the page/pages/stream level.
@@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
     
     // PDFBOX-3442 cache fonts that are not indirect objects, as these aren't 
cached in ResourceCache
     // and this would result in huge memory footprint in text extraction
-    private final Map<COSName, SoftReference<PDFont>> directFontCache;
+    // NOTE: changed from SoftReference to strong reference to prevent GC 
clearing under memory pressure
+    // causing repeated re-parse of large CID fonts (death spiral leading to 
OOM)
+    private final Map<COSName, PDFont> directFontCache;
 
     /**
      * Constructor for embedding.
@@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
      * @param directFontCache The document's direct font cache. Must be mutable
      */
     public PDResources(COSDictionary resourceDictionary, ResourceCache 
resourceCache,
-            Map<COSName, SoftReference<PDFont>> directFontCache)
+            Map<COSName, PDFont> directFontCache)
     {
         if (resourceDictionary == null)
         {
@@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            SoftReference<PDFont> ref = directFontCache.get(name);
-            if (ref != null)
+            System.out.println("Font " + name + " is not an indirect object, 
caching in directFontCache");
+            PDFont cached = directFontCache.get(name);
+            if (cached != null)
             {
-                PDFont cached = ref.get();
-                if (cached != null)
-                {
-                    return cached;
-                }
+                return cached;
             }
         }
 
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            directFontCache.put(name, new SoftReference<>(font));
+            directFontCache.put(name, font);
         }
         return font;
     }
```
## What Was Changed (Key Code Changes)

* The `PDResources` field was changed from:

  * **Before:**
    `Map<COSName, SoftReference<PDFont>> directFontCache`

  * **After:**
    `Map<COSName, PDFont> directFontCache`

* The constructor parameter type was also updated accordingly:
  `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.

* In `getFont(...)`, the retrieval logic was simplified:

  * **Before:**
    Retrieved a `SoftReference` and then called `ref.get()` to obtain the 
`PDFont`.

  * **After:**
    The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.

* The caching logic was also updated:

  * **Before:**
    `directFontCache.put(name, new SoftReference<>(font))`

  * **After:**
    `directFontCache.put(name, font)`

* Additionally, a debug message was added:

  ```java
  System.out.println("Font " + name + " is not an indirect object, caching in 
directFontCache");
  ```

---

## Original Problem (Why the Fix Was Necessary)

The original implementation stored **fonts embedded directly as dictionaries** 
using `SoftReference`.

`SoftReference` allows the JVM to **automatically clear cached objects when 
memory becomes low**.

This created a problem:

1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system **re-parses the font 
from the PDF**.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:

```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```

This **reparse loop** can eventually cause an **OutOfMemoryError (OOM)**.

The issue becomes particularly severe with **CID fonts (large CJK fonts)**, 
because parsing them creates very large in-memory structures.

---

## What This Fix Improves

By switching `directFontCache` to **strong references (`PDFont`)**, the JVM can 
no longer clear the cached fonts automatically.

This prevents the cycle:

```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```

As a result, the system **stops repeatedly parsing the same fonts**, preventing 
unnecessary heap consumption and avoiding the OOM scenario.

---


### [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java)
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
 
 import java.io.IOException;
 import java.io.InputStream;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.Map;
+
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
-
 import org.apache.pdfbox.cos.COSArray;
 import org.apache.pdfbox.cos.COSBase;
 import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     private float defaultWidth;
     private float averageWidth;
 
-    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y
-    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v
+    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y (individual entries)
+    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v   (individual entries)
+    // Range-based W2 entries stored as compact primitive arrays to avoid 
HashMap boxing overhead.
+    // A single range entry (first..last) replaces thousands of individual 
HashMap entries.
+    private int[] vdRangeFirst = new int[0];
+    private int[] vdRangeLast  = new int[0];
+    private float[] vdRangeW1y = new float[0];
+    private float[] vdRangeVx  = new float[0];
+    private float[] vdRangeVy  = new float[0];
     private final float[] dw2 = new float[] \{ 880, -1000 };
 
     protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     {
         this.dict = fontDictionary;
         this.parent = parent;
+        Runtime rt = Runtime.getRuntime();
+        String fontName = fontDictionary.getNameAsString(COSName.BASE_FONT);
+        System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  
used=%6.1fMB%n",
+            fontName,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         readWidths();
+        System.out.printf("[PDCIDFont] after readWidths  %-32s  widths.size=%d 
 free=%6.1fMB%n",
+            fontName, widths.size(), rt.freeMemory() / 1024.0 / 1024.0);
         readVerticalDisplacements();
+        System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  
ranges=%d  free=%6.1fMB%n",
+            fontName, verticalDisplacementY.size(), vdRangeFirst.length,
+            rt.freeMemory() / 1024.0 / 1024.0);
     }
 
     private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
                     COSNumber w1y = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1x = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1y = (COSNumber) w2Array.getObject(++i);
-                    for (int cid = first; cid <= last; cid++)
-                    {
-                        verticalDisplacementY.put(cid, w1y.floatValue());
-                        positionVectors.put(cid, new Vector(v1x.floatValue(), 
v1y.floatValue()));
-                    }
+                    // Store as a compact range entry instead of expanding to 
per-CID HashMap entries.
+                    // This avoids allocating thousands of boxed 
Integer/Float/Vector objects
+                    // when a single range covers many CIDs (e.g. 0..16000).
+                    int n = vdRangeFirst.length;
+                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
+                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
+                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
+                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
+                    vdRangeFirst[n] = first;
+                    vdRangeLast[n]  = last;
+                    vdRangeW1y[n]   = w1y.floatValue();
+                    vdRangeVx[n]    = v1x.floatValue();
+                    vdRangeVy[n]    = v1y.floatValue();
                 }
             }
         }
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public Vector getPositionVector(int code)
     {
         int cid = codeToCID(code);
+        // Check individual (array-format) entries first
         Vector v = positionVectors.get(cid);
-        if (v == null)
+        if (v != null)
+        {
+            return v;
+        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
         {
-            v = getDefaultPositionVector(cid);
+            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
+            {
+                return new Vector(vdRangeVx[i], vdRangeVy[i]);
+            }
         }
-        return v;
+        return getDefaultPositionVector(cid);
     }
 
     /**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public float getVerticalDisplacementVectorY(int code)
     {
         int cid = codeToCID(code);
+        // Check individual (array-format) entries first
         Float w1y = verticalDisplacementY.get(cid);
-        if (w1y == null)
+        if (w1y != null)
         {
-            w1y = dw2[1];
+            return w1y;
+        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
+        {
+            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
+            {
+                return vdRangeW1y[i];
+            }
         }
-        return w1y;
+        return dw2[1];
     }
 
     @Override

```

To reduce memory consumption, I stopped expanding **large CID ranges in W2 
(vertical metrics)** one by one into massive numbers of objects. Instead, the 
range information is now stored in **small primitive arrays**. This avoids 
creating large numbers of `Integer`, `Float`, and `Vector` objects and prevents 
**OutOfMemoryError (OOM)**.

## Changes

* Added `Arrays` to the imports (used for array expansion).

* Added new fields:

  * `vdRangeFirst` / `vdRangeLast` (`int[]`)
  * `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
    These arrays store the values for each **range entry (first..last)**.

* Added memory status logging in the constructor:

  * Logs free/used memory **before and after `readWidths`** and **after 
`readVD`**.

* Changes in `readVerticalDisplacements()`:

  * The existing **array format (individual entries)** is handled as before and 
stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
  * When a **range format** entry (`first last w1y v1x v1y`) is encountered, 
instead of looping from `first..last` and inserting each value into the 
`HashMap`, the code now:

    * Expands the `vdRange*` arrays using `Arrays.copyOf`
    * Stores the **range as a single entry** in those arrays.
  * Purpose: prevent generating **tens of thousands of objects** for large 
ranges (e.g., 16,000 entries).

* Changes in the lookup logic (`getPositionVector` / 
`getVerticalDisplacementVectorY`):

  1. First check **individual entries** (`positionVectors` / 
`verticalDisplacementY`).

     * If found, return that value.
  2. Otherwise, **linearly search** the `vdRange*` arrays to see if the CID 
falls within a stored range.

     * If a matching range is found, return the corresponding value.
  3. If neither matches, return the **default value**.

## Why this works (briefly)

* **Before:**
  Receiving a range like `0..16000` caused the code to create **16,001 
`Integer`/`Float`/`Vector` objects** and store them in a `HashMap`, leading to 
large memory overhead.

* **Now:**
  The same range is stored as **a single range entry**, reducing memory usage 
to **only a few dozen bytes per range**.

---

### 
[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java)
```diff
diff --git 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
index 3b970edd24..5b574d8c66 100644
--- 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
+++ 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
@@ -18,9 +18,7 @@ package org.apache.pdfbox.pdmodel.interactive.form;
 
 import java.awt.geom.GeneralPath;
 import java.awt.geom.Rectangle2D;
-
 import java.io.IOException;
-import java.lang.ref.SoftReference;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.HashMap;
@@ -75,7 +73,7 @@ public final class PDAcroForm implements COSObjectable
 
     private ScriptingHandler scriptingHandler;
 
-    private final Map<COSName, SoftReference<PDFont>> directFontCache = new 
HashMap<>();
+    private final Map<COSName, PDFont> directFontCache = new HashMap<>();
 
     /**
      * Constructor.

```
* **What was changed**

  * The type of `directFontCache` was changed:

    * **Before:** `Map<COSName, SoftReference<PDFont>>`
    * **After:** `Map<COSName, PDFont>`
  * As a result, the `SoftReference` import was removed, and the `get`/`put` 
logic was rewritten to store and retrieve `PDFont` directly instead of going 
through `SoftReference`.

* **Why this was changed (problem description)**

  * Previously, `SoftReference` was used so that the JVM could automatically 
clear the referenced `PDFont` objects when memory became tight.
  * However, when the reference was cleared, the same font would be parsed 
again the next time it was requested. Since font parsing is expensive, this 
could happen repeatedly, causing excessive memory allocations.
  * In some cases this repeated parsing led to unnecessary memory pressure and 
eventually an `OutOfMemoryError`.
  * By switching to strong references, the cache is no longer cleared 
unpredictably by the GC, preventing this repeated parse cycle.

* **Behavior after the change (effect)**

  * The same embedded font will no longer be parsed multiple times within the 
same page or the same `PDResources`.
  * This reduces unnecessary memory allocations and helps prevent 
`OutOfMemoryError`.
  * Fonts are expected to be released according to the lifecycle of 
`PDResources` / `PDDocument`, typically when page processing is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to