[ 
https://issues.apache.org/jira/browse/PDFBOX-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kiyotsuki Suzuki updated PDFBOX-6175:
-------------------------------------
    Description: 
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
font parsing / text extraction even with modest heap sizes. {color}

*<Observed symptom>*
 - OOM occurs while creating PDFont / PDCIDFont instances during text 
extraction.
 - Problem appears when fonts are embedded as non-indirect objects and when W2 
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).

 * Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
clears soft references, causing cached font objects to be discarded. Subsequent 
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle 
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID 
HashMap entries (boxed Integer/Float and Vector objects). A single large range 
(e.g. 0..16000) causes creation of tens of thousands of objects and large 
HashMap memory overhead, causing immediate heap exhaustion.
 * Suggested fixes (implementation-level guidance)
 -- Avoid relying on SoftReference for the per-resource direct font cache for 
non-indirect embedded fonts. Use strong references scoped to PDResources (or 
make the behavior configurable). PDResources is freed with the document 
lifecycle, so strong references prevent repeated re-parsing without leaking 
across documents.
 -- Do not expand large W2 ranges into individual boxed map entries. Parse and 
store W2 ranges compactly (e.g. range list with primitive arrays or small 
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
use a compact index). This avoids creating thousands of Integer/Float/Vector 
objects for wide ranges.
 -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
regressions, and add a memory-use test if possible.

 * Why fix should be upstream
 ** This is a parser/runtime efficiency bug that affects robustness for 
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
large allocations across all users.

 

*<Example>*

{*}[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]{*}([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
{code:java}
// code placeholder
### 
[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
         // we don't actually read the complete table here because it can 
contain tens of thousands of glyphs
         // cache the relevant part of the font data so that the data stream 
can be closed if it is no longer needed
         byte[] dataBytes = data.read((int) getLength());
+        Runtime rt = Runtime.getRuntime();
+        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
free=%6.1fMB  used=%6.1fMB%n",
+            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         try (RandomAccessReadBuffer read = new 
RandomAccessReadBuffer(dataBytes))
         {
             this.data = new RandomAccessReadDataStream(read);
``` {code}
This is a log that prints the read size and the JVM memory state (free/used) 
immediately after loading the byte array of the GlyphTable, in order to 
visualize how much memory was consumed and whether the heap is becoming 
constrained.In this investigation, it helped confirm that “the heap decreases 
with each load and then recovers after GC,” which was useful for tracing the 
cause of the OutOfMemoryError (OOM). ** **
 # 
 ##  

 

*[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java*

([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java|http://example.com/])
 

 
{code:java}
// code placeholder
@@ -17,7 +17,6 @@
 package org.apache.pdfbox.pdmodel;
 
 import java.io.IOException;
-import java.lang.ref.SoftReference;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.Map;
@@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
 import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
 import org.apache.pdfbox.pdmodel.font.PDFont;
 import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
 import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
 import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
 
 /**
  * A set of resources available at the page/pages/stream level.
@@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
     
     // PDFBOX-3442 cache fonts that are not indirect objects, as these aren't 
cached in ResourceCache
     // and this would result in huge memory footprint in text extraction
-    private final Map<COSName, SoftReference<PDFont>> directFontCache;
+    // NOTE: changed from SoftReference to strong reference to prevent GC 
clearing under memory pressure
+    // causing repeated re-parse of large CID fonts (death spiral leading to 
OOM)
+    private final Map<COSName, PDFont> directFontCache;
 
     /**
      * Constructor for embedding.
@@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
      * @param directFontCache The document's direct font cache. Must be mutable
      */
     public PDResources(COSDictionary resourceDictionary, ResourceCache 
resourceCache,
-            Map<COSName, SoftReference<PDFont>> directFontCache)
+            Map<COSName, PDFont> directFontCache)
     {
         if (resourceDictionary == null)
         {
@@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            SoftReference<PDFont> ref = directFontCache.get(name);
-            if (ref != null)
+            System.out.println("Font " + name + " is not an indirect object, 
caching in directFontCache");
+            PDFont cached = directFontCache.get(name);
+            if (cached != null)
             {
-                PDFont cached = ref.get();
-                if (cached != null)
-                {
-                    return cached;
-                }
+                return cached;
             }
         }
 
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            directFontCache.put(name, new SoftReference<>(font));
+            directFontCache.put(name, font);
         }
         return font;
     } {code}
 * What Was Changed (Key Code Changes)
 ** The `PDResources` field was changed from:
 *** Before: `Map<COSName, SoftReference<PDFont>> directFontCache`
 *** After: `Map<COSName, PDFont> directFontCache`
 ** The constructor parameter type was also updated accordingly:
 *** `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
 ** In `getFont(...)`, the retrieval logic was simplified:
 *** Before: Retrieved a `SoftReference` and then called `ref.get()` to obtain 
the `PDFont`.
 *** After: The `PDFont` is retrieved directly, so `ref.get()` is no longer 
needed.
 ** The caching logic was also updated:
 *** Before:`directFontCache.put(name, new SoftReference<>(font))`
 *** After: `directFontCache.put(name, font)`
 * Original Problem (Why the Fix Was Necessary)
 ** The original implementation stored *{*}fonts embedded directly as 
dictionaries{*}* using `SoftReference`.

`SoftReference` allows the JVM to *{*}automatically clear cached objects when 
memory becomes low{*}*.

This created a problem:

1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system *{*}re-parses the font 
from the PDF{*}*.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:
 * 
 ** 
 *** ```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```
 *** This reparse loop can eventually cause an OutOfMemoryError (OOM). The 
issue becomes particularly severe with *{*}CID fonts (large CJK fonts){*}*, 
because parsing them creates very large in-memory structures.
 ** What This Fix Improves
 *** By switching `directFontCache` to *{*}strong references (`PDFont`){*}*, 
the JVM can no longer clear the cached fonts automatically.
 *** This prevents the cycle:
 **** ```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```

 *** As a result, the system *{*}stops repeatedly parsing the same fonts{*}*, 
preventing unnecessary heap consumption and avoiding the OOM scenario.

 

{*}[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
]{*}([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java|[http://example.com])]
{code:java}
// code placeholder
### [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java)
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
 
 import java.io.IOException;
 import java.io.InputStream;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.Map;
+
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
-
 import org.apache.pdfbox.cos.COSArray;
 import org.apache.pdfbox.cos.COSBase;
 import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     private float defaultWidth;
     private float averageWidth;
 
-    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y
-    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v
+    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y (individual entries)
+    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v   (individual entries)
+    // Range-based W2 entries stored as compact primitive arrays to avoid 
HashMap boxing overhead.
+    // A single range entry (first..last) replaces thousands of individual 
HashMap entries.
+    private int[] vdRangeFirst = new int[0];
+    private int[] vdRangeLast  = new int[0];
+    private float[] vdRangeW1y = new float[0];
+    private float[] vdRangeVx  = new float[0];
+    private float[] vdRangeVy  = new float[0];
     private final float[] dw2 = new float[] { 880, -1000 };
 
     protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     {
         this.dict = fontDictionary;
         this.parent = parent;
+        Runtime rt = Runtime.getRuntime();
+        String fontName = fontDictionary.getNameAsString(COSName.BASE_FONT);
+        System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  
used=%6.1fMB%n",
+            fontName,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         readWidths();
+        System.out.printf("[PDCIDFont] after readWidths  %-32s  widths.size=%d 
 free=%6.1fMB%n",
+            fontName, widths.size(), rt.freeMemory() / 1024.0 / 1024.0);
         readVerticalDisplacements();
+        System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  
ranges=%d  free=%6.1fMB%n",
+            fontName, verticalDisplacementY.size(), vdRangeFirst.length,
+            rt.freeMemory() / 1024.0 / 1024.0);
     }
 
     private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
                     COSNumber w1y = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1x = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1y = (COSNumber) w2Array.getObject(++i);
-                    for (int cid = first; cid <= last; cid++)
-                    {
-                        verticalDisplacementY.put(cid, w1y.floatValue());
-                        positionVectors.put(cid, new Vector(v1x.floatValue(), 
v1y.floatValue()));
-                    }
+                    // Store as a compact range entry instead of expanding to 
per-CID HashMap entries.
+                    // This avoids allocating thousands of boxed 
Integer/Float/Vector objects
+                    // when a single range covers many CIDs (e.g. 0..16000).
+                    int n = vdRangeFirst.length;
+                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
+                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
+                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
+                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
+                    vdRangeFirst[n] = first;
+                    vdRangeLast[n]  = last;
+                    vdRangeW1y[n]   = w1y.floatValue();
+                    vdRangeVx[n]    = v1x.floatValue();
+                    vdRangeVy[n]    = v1y.floatValue();
                 }
             }
         }
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public Vector getPositionVector(int code)
     {
         int cid = codeToCID(code);
+        // Check individual (array-format) entries first
         Vector v = positionVectors.get(cid);
-        if (v == null)
+        if (v != null)
+        {
+            return v;
+        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
         {
-            v = getDefaultPositionVector(cid);
+            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
+            {
+                return new Vector(vdRangeVx[i], vdRangeVy[i]);
+            }
         }
-        return v;
+        return getDefaultPositionVector(cid);
     }
 
     /**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public float getVerticalDisplacementVectorY(int code)
     {
         int cid = codeToCID(code);
+        // Check individual (array-format) entries first
         Float w1y = verticalDisplacementY.get(cid);
-        if (w1y == null)
+        if (w1y != null)
         {
-            w1y = dw2[1];
+            return w1y;
+        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
+        {
+            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
+            {
+                return vdRangeW1y[i];
+            }
         }
-        return w1y;
+        return dw2[1];
     }
 
     @Override``` {code}
To reduce memory consumption, I stopped expanding **large CID ranges in W2 
(vertical metrics)** one by one into massive numbers of objects. Instead, the 
range information is now stored in **small primitive arrays**. This avoids 
creating large numbers of `Integer`, `Float`, and `Vector` objects and prevents 
**OutOfMemoryError (OOM)**.
 * Changes
 ** Added `Arrays` to the imports (used for array expansion).
 ** Added new fields:
 *** `vdRangeFirst` / `vdRangeLast` (`int[]`)
 *** `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
 **** These arrays store the values for each **range entry (first..last)**.
 ** Changes in `readVerticalDisplacements()`:
 *** The existing **array format (individual entries)** is handled as before 
and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
 *** When a **range format** entry (`first last w1y v1x v1y`) is encountered, 
instead of looping from `first..last` and inserting each value into the 
`HashMap`, the code now:
 **** Expands the `vdRange*` arrays using `Arrays.copyOf`
 **** Stores the **range as a single entry** in those arrays.
 *** Purpose: prevent generating **tens of thousands of objects** for large 
ranges (e.g., 16,000 entries).
 ** Changes in the lookup logic (`getPositionVector` / 
`getVerticalDisplacementVectorY`):
 *** First check **individual entries** (`positionVectors` / 
`verticalDisplacementY`).
 **** If found, return that value.
 *** Otherwise, **linearly search** the `vdRange*` arrays to see if the CID 
falls within a stored range.
 **** If a matching range is found, return the corresponding value.
 *** If neither matches, return the **default value**.
 ** Why this works (briefly)
 *** Before: Receiving a range like `0..16000` caused the code to create 
**16,001 `Integer`/`Float`/`Vector` objects** and store them in a `HashMap`, 
leading to large memory overhead.
 *** Now: The same range is stored as **a single range entry**, reducing memory 
usage to **only a few dozen bytes per range**.

 

  was:
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
font parsing / text extraction even with modest heap sizes. {color}

*<Observed symptom>*
 - OOM occurs while creating PDFont / PDCIDFont instances during text 
extraction.
 - Problem appears when fonts are embedded as non-indirect objects and when W2 
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).

 * Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
clears soft references, causing cached font objects to be discarded. Subsequent 
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle 
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID 
HashMap entries (boxed Integer/Float and Vector objects). A single large range 
(e.g. 0..16000) causes creation of tens of thousands of objects and large 
HashMap memory overhead, causing immediate heap exhaustion.
 * Suggested fixes (implementation-level guidance)
 -- Avoid relying on SoftReference for the per-resource direct font cache for 
non-indirect embedded fonts. Use strong references scoped to PDResources (or 
make the behavior configurable). PDResources is freed with the document 
lifecycle, so strong references prevent repeated re-parsing without leaking 
across documents.
 -- Do not expand large W2 ranges into individual boxed map entries. Parse and 
store W2 ranges compactly (e.g. range list with primitive arrays or small 
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
use a compact index). This avoids creating thousands of Integer/Float/Vector 
objects for wide ranges.
 -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
regressions, and add a memory-use test if possible.

 * Why fix should be upstream
 ** This is a parser/runtime efficiency bug that affects robustness for 
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
large allocations across all users.

 

*<Example>*

{*}[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]{*}([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
{code:java}
// code placeholder
### 
[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
         // we don't actually read the complete table here because it can 
contain tens of thousands of glyphs
         // cache the relevant part of the font data so that the data stream 
can be closed if it is no longer needed
         byte[] dataBytes = data.read((int) getLength());
+        Runtime rt = Runtime.getRuntime();
+        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
free=%6.1fMB  used=%6.1fMB%n",
+            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         try (RandomAccessReadBuffer read = new 
RandomAccessReadBuffer(dataBytes))
         {
             this.data = new RandomAccessReadDataStream(read);
``` {code}
This is a log that prints the read size and the JVM memory state (free/used) 
immediately after loading the byte array of the GlyphTable, in order to 
visualize how much memory was consumed and whether the heap is becoming 
constrained.In this investigation, it helped confirm that “the heap decreases 
with each load and then recovers after GC,” which was useful for tracing the 
cause of the OutOfMemoryError (OOM). ** **
 # 
 ##  

 

*[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java*

([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java|http://example.com])
 

 
{code:java}
// code placeholder
@@ -17,7 +17,6 @@
 package org.apache.pdfbox.pdmodel;
 
 import java.io.IOException;
-import java.lang.ref.SoftReference;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.Map;
@@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
 import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
 import org.apache.pdfbox.pdmodel.font.PDFont;
 import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
 import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
 import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
 
 /**
  * A set of resources available at the page/pages/stream level.
@@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
     
     // PDFBOX-3442 cache fonts that are not indirect objects, as these aren't 
cached in ResourceCache
     // and this would result in huge memory footprint in text extraction
-    private final Map<COSName, SoftReference<PDFont>> directFontCache;
+    // NOTE: changed from SoftReference to strong reference to prevent GC 
clearing under memory pressure
+    // causing repeated re-parse of large CID fonts (death spiral leading to 
OOM)
+    private final Map<COSName, PDFont> directFontCache;
 
     /**
      * Constructor for embedding.
@@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
      * @param directFontCache The document's direct font cache. Must be mutable
      */
     public PDResources(COSDictionary resourceDictionary, ResourceCache 
resourceCache,
-            Map<COSName, SoftReference<PDFont>> directFontCache)
+            Map<COSName, PDFont> directFontCache)
     {
         if (resourceDictionary == null)
         {
@@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            SoftReference<PDFont> ref = directFontCache.get(name);
-            if (ref != null)
+            System.out.println("Font " + name + " is not an indirect object, 
caching in directFontCache");
+            PDFont cached = directFontCache.get(name);
+            if (cached != null)
             {
-                PDFont cached = ref.get();
-                if (cached != null)
-                {
-                    return cached;
-                }
+                return cached;
             }
         }
 
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         {
-            directFontCache.put(name, new SoftReference<>(font));
+            directFontCache.put(name, font);
         }
         return font;
     } {code}
 * What Was Changed (Key Code Changes)
 ** The `PDResources` field was changed from:
 *** Before: `Map<COSName, SoftReference<PDFont>> directFontCache`
 *** After: `Map<COSName, PDFont> directFontCache`
 ** The constructor parameter type was also updated accordingly:
 *** `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
 ** In `getFont(...)`, the retrieval logic was simplified:
 *** Before: Retrieved a `SoftReference` and then called `ref.get()` to obtain 
the `PDFont`.
 *** After: The `PDFont` is retrieved directly, so `ref.get()` is no longer 
needed.
 ** The caching logic was also updated:
 *** Before:`directFontCache.put(name, new SoftReference<>(font))`
 *** After: `directFontCache.put(name, font)`
 * Original Problem (Why the Fix Was Necessary)
 ** The original implementation stored **fonts embedded directly as 
dictionaries** using `SoftReference`.

`SoftReference` allows the JVM to **automatically clear cached objects when 
memory becomes low**.

This created a problem:

1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system **re-parses the font 
from the PDF**.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:

 *** ```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```
 *** This reparse loop can eventually cause an OutOfMemoryError (OOM). The 
issue becomes particularly severe with **CID fonts (large CJK fonts)**, because 
parsing them creates very large in-memory structures.
 ** What This Fix Improves
 *** By switching `directFontCache` to **strong references (`PDFont`)**, the 
JVM can no longer clear the cached fonts automatically.
 *** This prevents the cycle:
 **** 
```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```
 *** As a result, the system **stops repeatedly parsing the same fonts**, 
preventing unnecessary heap consumption and avoiding the OOM scenario.

 


> OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + 
> W2 range expansion leads to OOM
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
>    Affects Versions: 3.0.5 PDFBox, 3.0.6 PDFBox, 3.0.7 PDFBox, 3.0.4 JBIG2
>         Environment: openjdk 21.0.7 2025-04-15
> OpenJDK Runtime Environment Homebrew (build 21.0.7)
> OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)
> pdfbox version:3.0.4
>            Reporter: Kiyotsuki Suzuki
>            Priority: Major
>              Labels: performance
>             Fix For: 3.0.4 JBIG2
>
>         Attachments: 6.pdf
>
>
> {color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
> and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
> font parsing / text extraction even with modest heap sizes. {color}
> *<Observed symptom>*
>  - OOM occurs while creating PDFont / PDCIDFont instances during text 
> extraction.
>  - Problem appears when fonts are embedded as non-indirect objects and when 
> W2 (vertical metrics) contains large ranges (e.g. first..last spanning many 
> CIDs).
>  * Likely root causes (two cooperating issues)
> 1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
> clears soft references, causing cached font objects to be discarded. 
> Subsequent uses re-parse the same (heavy) font repeatedly. This GC -> 
> re-parse -> GC cycle can escalate memory usage and trigger OOM.
> 2. W2 range entries (first last w1y v.x v.y) are expanded naively into 
> per-CID HashMap entries (boxed Integer/Float and Vector objects). A single 
> large range (e.g. 0..16000) causes creation of tens of thousands of objects 
> and large HashMap memory overhead, causing immediate heap exhaustion.
>  * Suggested fixes (implementation-level guidance)
>  -- Avoid relying on SoftReference for the per-resource direct font cache for 
> non-indirect embedded fonts. Use strong references scoped to PDResources (or 
> make the behavior configurable). PDResources is freed with the document 
> lifecycle, so strong references prevent repeated re-parsing without leaking 
> across documents.
>  -- Do not expand large W2 ranges into individual boxed map entries. Parse 
> and store W2 ranges compactly (e.g. range list with primitive arrays or small 
> objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
> use a compact index). This avoids creating thousands of Integer/Float/Vector 
> objects for wide ranges.
>  -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
> regressions, and add a memory-use test if possible.
>  * Why fix should be upstream
>  ** This is a parser/runtime efficiency bug that affects robustness for 
> real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
> large allocations across all users.
>  
> *<Example>*
> {*}[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]{*}([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
> {code:java}
> // code placeholder
> ### 
> [/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
> ```diff
> @@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
>          // we don't actually read the complete table here because it can 
> contain tens of thousands of glyphs
>          // cache the relevant part of the font data so that the data stream 
> can be closed if it is no longer needed
>          byte[] dataBytes = data.read((int) getLength());
> +        Runtime rt = Runtime.getRuntime();
> +        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
> free=%6.1fMB  used=%6.1fMB%n",
> +            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
> +            rt.freeMemory() / 1024.0 / 1024.0,
> +            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
>          try (RandomAccessReadBuffer read = new 
> RandomAccessReadBuffer(dataBytes))
>          {
>              this.data = new RandomAccessReadDataStream(read);
> ``` {code}
> This is a log that prints the read size and the JVM memory state (free/used) 
> immediately after loading the byte array of the GlyphTable, in order to 
> visualize how much memory was consumed and whether the heap is becoming 
> constrained.In this investigation, it helped confirm that “the heap decreases 
> with each load and then recovers after GC,” which was useful for tracing the 
> cause of the OutOfMemoryError (OOM). ** **
>  # 
>  ##  
>  
> *[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java*
> ([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java|http://example.com/])
>  
>  
> {code:java}
> // code placeholder
> @@ -17,7 +17,6 @@
>  package org.apache.pdfbox.pdmodel;
>  
>  import java.io.IOException;
> -import java.lang.ref.SoftReference;
>  import java.util.Collections;
>  import java.util.HashMap;
>  import java.util.Map;
> @@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
>  import 
> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>  import org.apache.pdfbox.pdmodel.font.PDFont;
>  import org.apache.pdfbox.pdmodel.font.PDFontFactory;
> +import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
>  import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> +import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>  import 
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
>  import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
> -import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> -import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
>  
>  /**
>   * A set of resources available at the page/pages/stream level.
> @@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
>      
>      // PDFBOX-3442 cache fonts that are not indirect objects, as these 
> aren't cached in ResourceCache
>      // and this would result in huge memory footprint in text extraction
> -    private final Map<COSName, SoftReference<PDFont>> directFontCache;
> +    // NOTE: changed from SoftReference to strong reference to prevent GC 
> clearing under memory pressure
> +    // causing repeated re-parse of large CID fonts (death spiral leading to 
> OOM)
> +    private final Map<COSName, PDFont> directFontCache;
>  
>      /**
>       * Constructor for embedding.
> @@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
>       * @param directFontCache The document's direct font cache. Must be 
> mutable
>       */
>      public PDResources(COSDictionary resourceDictionary, ResourceCache 
> resourceCache,
> -            Map<COSName, SoftReference<PDFont>> directFontCache)
> +            Map<COSName, PDFont> directFontCache)
>      {
>          if (resourceDictionary == null)
>          {
> @@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
>          }
>          else if (indirect == null)
>          {
> -            SoftReference<PDFont> ref = directFontCache.get(name);
> -            if (ref != null)
> +            System.out.println("Font " + name + " is not an indirect object, 
> caching in directFontCache");
> +            PDFont cached = directFontCache.get(name);
> +            if (cached != null)
>              {
> -                PDFont cached = ref.get();
> -                if (cached != null)
> -                {
> -                    return cached;
> -                }
> +                return cached;
>              }
>          }
>  
> @@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
>          }
>          else if (indirect == null)
>          {
> -            directFontCache.put(name, new SoftReference<>(font));
> +            directFontCache.put(name, font);
>          }
>          return font;
>      } {code}
>  * What Was Changed (Key Code Changes)
>  ** The `PDResources` field was changed from:
>  *** Before: `Map<COSName, SoftReference<PDFont>> directFontCache`
>  *** After: `Map<COSName, PDFont> directFontCache`
>  ** The constructor parameter type was also updated accordingly:
>  *** `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
>  ** In `getFont(...)`, the retrieval logic was simplified:
>  *** Before: Retrieved a `SoftReference` and then called `ref.get()` to 
> obtain the `PDFont`.
>  *** After: The `PDFont` is retrieved directly, so `ref.get()` is no longer 
> needed.
>  ** The caching logic was also updated:
>  *** Before:`directFontCache.put(name, new SoftReference<>(font))`
>  *** After: `directFontCache.put(name, font)`
>  * Original Problem (Why the Fix Was Necessary)
>  ** The original implementation stored *{*}fonts embedded directly as 
> dictionaries{*}* using `SoftReference`.
> `SoftReference` allows the JVM to *{*}automatically clear cached objects when 
> memory becomes low{*}*.
> This created a problem:
> 1. When memory pressure occurs, the JVM clears the cached fonts.
> 2. The next time the same font is requested, the system *{*}re-parses the 
> font from the PDF{*}*.
> 3. Font parsing can allocate large temporary structures.
> 4. Under memory pressure, this leads to a loop:
>  * 
>  ** 
>  *** ```
> GC clears font cache
> → font requested again
> → font parsed again
> → large memory allocation
> → GC clears again
> → repeat
> ```
>  *** This reparse loop can eventually cause an OutOfMemoryError (OOM). The 
> issue becomes particularly severe with *{*}CID fonts (large CJK fonts){*}*, 
> because parsing them creates very large in-memory structures.
>  ** What This Fix Improves
>  *** By switching `directFontCache` to *{*}strong references (`PDFont`){*}*, 
> the JVM can no longer clear the cached fonts automatically.
>  *** This prevents the cycle:
>  **** ```
> memory pressure
> → cache cleared
> → font re-parsed
> → more memory pressure
> ```
>  *** As a result, the system *{*}stops repeatedly parsing the same fonts{*}*, 
> preventing unnecessary heap consumption and avoiding the OOM scenario.
>  
> {*}[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
> ]{*}([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java|[http://example.com])]
> {code:java}
> // code placeholder
> ### [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
> ](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java)
> ```diff
> @@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
>  
>  import java.io.IOException;
>  import java.io.InputStream;
> +import java.util.Arrays;
>  import java.util.HashMap;
>  import java.util.Map;
> +
>  import org.apache.commons.logging.Log;
>  import org.apache.commons.logging.LogFactory;
> -
>  import org.apache.pdfbox.cos.COSArray;
>  import org.apache.pdfbox.cos.COSBase;
>  import org.apache.pdfbox.cos.COSDictionary;
> @@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>      private float defaultWidth;
>      private float averageWidth;
>  
> -    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y
> -    private final Map<Integer, Vector> positionVectors = new HashMap<>();    
>  // v
> +    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y (individual entries)
> +    private final Map<Integer, Vector> positionVectors = new HashMap<>();    
>  // v   (individual entries)
> +    // Range-based W2 entries stored as compact primitive arrays to avoid 
> HashMap boxing overhead.
> +    // A single range entry (first..last) replaces thousands of individual 
> HashMap entries.
> +    private int[] vdRangeFirst = new int[0];
> +    private int[] vdRangeLast  = new int[0];
> +    private float[] vdRangeW1y = new float[0];
> +    private float[] vdRangeVx  = new float[0];
> +    private float[] vdRangeVy  = new float[0];
>      private final float[] dw2 = new float[] { 880, -1000 };
>  
>      protected final COSDictionary dict;
> @@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>      {
>          this.dict = fontDictionary;
>          this.parent = parent;
> +        Runtime rt = Runtime.getRuntime();
> +        String fontName = fontDictionary.getNameAsString(COSName.BASE_FONT);
> +        System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  
> used=%6.1fMB%n",
> +            fontName,
> +            rt.freeMemory() / 1024.0 / 1024.0,
> +            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
>          readWidths();
> +        System.out.printf("[PDCIDFont] after readWidths  %-32s  
> widths.size=%d  free=%6.1fMB%n",
> +            fontName, widths.size(), rt.freeMemory() / 1024.0 / 1024.0);
>          readVerticalDisplacements();
> +        System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  
> ranges=%d  free=%6.1fMB%n",
> +            fontName, verticalDisplacementY.size(), vdRangeFirst.length,
> +            rt.freeMemory() / 1024.0 / 1024.0);
>      }
>  
>      private void readWidths()
> @@ -180,11 +199,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>                      COSNumber w1y = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1x = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1y = (COSNumber) w2Array.getObject(++i);
> -                    for (int cid = first; cid <= last; cid++)
> -                    {
> -                        verticalDisplacementY.put(cid, w1y.floatValue());
> -                        positionVectors.put(cid, new 
> Vector(v1x.floatValue(), v1y.floatValue()));
> -                    }
> +                    // Store as a compact range entry instead of expanding 
> to per-CID HashMap entries.
> +                    // This avoids allocating thousands of boxed 
> Integer/Float/Vector objects
> +                    // when a single range covers many CIDs (e.g. 0..16000).
> +                    int n = vdRangeFirst.length;
> +                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
> +                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
> +                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
> +                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
> +                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
> +                    vdRangeFirst[n] = first;
> +                    vdRangeLast[n]  = last;
> +                    vdRangeW1y[n]   = w1y.floatValue();
> +                    vdRangeVx[n]    = v1x.floatValue();
> +                    vdRangeVy[n]    = v1y.floatValue();
>                  }
>              }
>          }
> @@ -288,12 +317,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public Vector getPositionVector(int code)
>      {
>          int cid = codeToCID(code);
> +        // Check individual (array-format) entries first
>          Vector v = positionVectors.get(cid);
> -        if (v == null)
> +        if (v != null)
> +        {
> +            return v;
> +        }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
>          {
> -            v = getDefaultPositionVector(cid);
> +            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> +            {
> +                return new Vector(vdRangeVx[i], vdRangeVy[i]);
> +            }
>          }
> -        return v;
> +        return getDefaultPositionVector(cid);
>      }
>  
>      /**
> @@ -305,12 +343,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public float getVerticalDisplacementVectorY(int code)
>      {
>          int cid = codeToCID(code);
> +        // Check individual (array-format) entries first
>          Float w1y = verticalDisplacementY.get(cid);
> -        if (w1y == null)
> +        if (w1y != null)
>          {
> -            w1y = dw2[1];
> +            return w1y;
> +        }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
> +        {
> +            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> +            {
> +                return vdRangeW1y[i];
> +            }
>          }
> -        return w1y;
> +        return dw2[1];
>      }
>  
>      @Override``` {code}
> To reduce memory consumption, I stopped expanding **large CID ranges in W2 
> (vertical metrics)** one by one into massive numbers of objects. Instead, the 
> range information is now stored in **small primitive arrays**. This avoids 
> creating large numbers of `Integer`, `Float`, and `Vector` objects and 
> prevents **OutOfMemoryError (OOM)**.
>  * Changes
>  ** Added `Arrays` to the imports (used for array expansion).
>  ** Added new fields:
>  *** `vdRangeFirst` / `vdRangeLast` (`int[]`)
>  *** `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
>  **** These arrays store the values for each **range entry (first..last)**.
>  ** Changes in `readVerticalDisplacements()`:
>  *** The existing **array format (individual entries)** is handled as before 
> and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
>  *** When a **range format** entry (`first last w1y v1x v1y`) is encountered, 
> instead of looping from `first..last` and inserting each value into the 
> `HashMap`, the code now:
>  **** Expands the `vdRange*` arrays using `Arrays.copyOf`
>  **** Stores the **range as a single entry** in those arrays.
>  *** Purpose: prevent generating **tens of thousands of objects** for large 
> ranges (e.g., 16,000 entries).
>  ** Changes in the lookup logic (`getPositionVector` / 
> `getVerticalDisplacementVectorY`):
>  *** First check **individual entries** (`positionVectors` / 
> `verticalDisplacementY`).
>  **** If found, return that value.
>  *** Otherwise, **linearly search** the `vdRange*` arrays to see if the CID 
> falls within a stored range.
>  **** If a matching range is found, return the corresponding value.
>  *** If neither matches, return the **default value**.
>  ** Why this works (briefly)
>  *** Before: Receiving a range like `0..16000` caused the code to create 
> **16,001 `Integer`/`Float`/`Vector` objects** and store them in a `HashMap`, 
> leading to large memory overhead.
>  *** Now: The same range is stored as **a single range entry**, reducing 
> memory usage to **only a few dozen bytes per range**.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to