[jira] [Updated] (PDFBOX-6175) OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + W2 range expansion leads to OOM

Kiyotsuki Suzuki (Jira) Thu, 12 Mar 2026 00:44:10 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kiyotsuki Suzuki updated PDFBOX-6175:
-------------------------------------
    Description: 
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
font parsing / text extraction even with modest heap sizes. {color}

*<Observed symptom>*
 - OOM occurs while creating PDFont / PDCIDFont instances during text 
extraction.
 - Problem appears when fonts are embedded as non-indirect objects and when W2 
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).

 * Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
clears soft references, causing cached font objects to be discarded. Subsequent 
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle 
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID 
HashMap entries (boxed Integer/Float and Vector objects). A single large range 
(e.g. 0..16000) causes creation of tens of thousands of objects and large 
HashMap memory overhead, causing immediate heap exhaustion.
 * Suggested fixes (implementation-level guidance)
 -- Avoid relying on SoftReference for the per-resource direct font cache for 
non-indirect embedded fonts. Use strong references scoped to PDResources (or 
make the behavior configurable). PDResources is freed with the document 
lifecycle, so strong references prevent repeated re-parsing without leaking 
across documents.
 -- Do not expand large W2 ranges into individual boxed map entries. Parse and 
store W2 ranges compactly (e.g. range list with primitive arrays or small 
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
use a compact index). This avoids creating thousands of Integer/Float/Vector 
objects for wide ranges.
 -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
regressions, and add a memory-use test if possible.

 * Why fix should be upstream
 ** This is a parser/runtime efficiency bug that affects robustness for 
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
large allocations across all users.

—

*<Example>*

[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
         // we don't actually read the complete table here because it can 
contain tens of thousands of glyphs
         // cache the relevant part of the font data so that the data stream 
can be closed if it is no longer needed
         byte[] dataBytes = data.read((int) getLength());
+        Runtime rt = Runtime.getRuntime();
+        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
free=%6.1fMB  used=%6.1fMB%n",
+            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         try (RandomAccessReadBuffer read = new 
RandomAccessReadBuffer(dataBytes))
         

{              this.data = new RandomAccessReadDataStream(read); ``` This is a 
log that prints the read size and the JVM memory state (free/used) immediately 
after loading the byte array of the GlyphTable, in order to visualize how much 
memory was consumed and whether the heap is becoming constrained. In this 
investigation, it helped confirm that “the heap decreases with each load and 
then recovers after GC,” which was useful for tracing the cause of the 
OutOfMemoryError (OOM). — 
[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
  ```diff @@ -17,7 +17,6 @@  package org.apache.pdfbox.pdmodel;    import 
java.io.IOException; -import java.lang.ref.SoftReference;  import 
java.util.Collections;  import java.util.HashMap;  import java.util.Map; @@ 
-31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;  import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;  
import org.apache.pdfbox.pdmodel.font.PDFont;  import 
org.apache.pdfbox.pdmodel.font.PDFontFactory; +import 
org.apache.pdfbox.pdmodel.graphics.PDXObject; +import 
org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;  import 
org.apache.pdfbox.pdmodel.graphics.color.PDPattern;  import 
org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; +import 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;  import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup; 
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState; 
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;  import 
org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;  import 
org.apache.pdfbox.pdmodel.graphics.shading.PDShading; -import 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; -import 
org.apache.pdfbox.pdmodel.graphics.PDXObject; +import 
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;    /**   * A 
set of resources available at the page/pages/stream level. @@ -54,7 +53,9 @@ 
public final class PDResources implements COSObjectable            // 
PDFBOX-3442 cache fonts that are not indirect objects, as these aren't cached 
in ResourceCache      // and this would result in huge memory footprint in text 
extraction -    private final Map<COSName, SoftReference<PDFont>> 
directFontCache; +    // NOTE: changed from SoftReference to strong reference 
to prevent GC clearing under memory pressure +    // causing repeated re-parse 
of large CID fonts (death spiral leading to OOM) +    private final 
Map<COSName, PDFont> directFontCache;        /**       * Constructor for 
embedding. @@ -107,7 +108,7 @@ public final class PDResources implements 
COSObjectable       * @param directFontCache The document's direct font cache. 
Must be mutable       */      public PDResources(COSDictionary 
resourceDictionary, ResourceCache resourceCache, -            Map<COSName, 
SoftReference<PDFont>> directFontCache) +            Map<COSName, PDFont> 
directFontCache)       \\{          if (resourceDictionary == null)           
\{ @@ -152,14 +153,11 @@ public final class PDResources implements 
COSObjectable          }

         else if (indirect == null)
         

{ -            SoftReference<PDFont> ref = directFontCache.get(name); -         
   if (ref != null) +            System.out.println("Font " + name + " is not 
an indirect object, caching in directFontCache"); +            PDFont cached = 
directFontCache.get(name); +            if (cached != null)               \\{ - 
               PDFont cached = ref.get(); -                if (cached != null) 
-                 \{ -                    return cached; -                }

+                return cached;
             }
         }
 
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         

{ -            directFontCache.put(name, new SoftReference<>(font)); +          
  directFontCache.put(name, font);          }

         return font;
     }
```
 # 
 ## What Was Changed (Key Code Changes)

 * The `PDResources` field was changed from:

  * {*}{{*}}Before:{{*}}{*}
    `Map<COSName, SoftReference<PDFont>> directFontCache`

  * {*}{{*}}After:{{*}}{*}
    `Map<COSName, PDFont> directFontCache`
 * The constructor parameter type was also updated accordingly:
  `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.

 * In `getFont(...)`, the retrieval logic was simplified:

  * {*}{{*}}Before:{{*}}{*}
    Retrieved a `SoftReference` and then called `ref.get()` to obtain the 
`PDFont`.

  * {*}{{*}}After:{{*}}{*}
    The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
 * The caching logic was also updated:

  * {*}{{*}}Before:{{*}}{*}
    `directFontCache.put(name, new SoftReference<>(font))`

  * {*}{{*}}After:{{*}}{*}
    `directFontCache.put(name, font)`
 * Additionally, a debug message was added:

  ```java
  System.out.println("Font " + name + " is not an indirect object, caching in 
directFontCache");
  ```

—

Original Problem (Why the Fix Was Necessary)

The original implementation stored {*}{{*}}fonts embedded directly as 
dictionaries{{*}}{*} using `SoftReference`.

`SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects 
when memory becomes low{{*}}{*}.

This created a problem:

1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system {*}{{*}}re-parses the 
font from the PDF{{*}}{*}.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:

```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```

This {*}{{*}}reparse loop{{*}}{*} can eventually cause an 
{*}{{*}}OutOfMemoryError (OOM){{*}}{*}.

The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK 
fonts){{*}}{*}, because parsing them creates very large in-memory structures.

—

What This Fix Improves

By switching `directFontCache` to {*}{{*}}strong references (`PDFont`){{*}}{*}, 
the JVM can no longer clear the cached fonts automatically.

This prevents the cycle:

```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```

As a result, the system {*}{{*}}stops repeatedly parsing the same 
fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM 
scenario.

—

[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
 
 import java.io.IOException;
 import java.io.InputStream;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.Map;
+
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
-
 import org.apache.pdfbox.cos.COSArray;
 import org.apache.pdfbox.cos.COSBase;
 import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     private float defaultWidth;
     private float averageWidth;
 
 -    private final Map<Integer, Float> verticalDisplacementY = new 
HashMap<>(); // w1y
 -    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v
+    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y (individual entries)
+    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v   (individual entries)
+    // Range-based W2 entries stored as compact primitive arrays to avoid 
HashMap boxing overhead.
+    // A single range entry (first..last) replaces thousands of individual 
HashMap entries.
+    private int[] vdRangeFirst = new int[0];
+    private int[] vdRangeLast  = new int[0];
+    private float[] vdRangeW1y = new float[0];
+    private float[] vdRangeVx  = new float[0];
+    private float[] vdRangeVy  = new float[0];
     private final float[] dw2 = new float[] \{ 880, -1000 };
 
     protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
      \{          this.dict = fontDictionary;          this.parent = parent; +  
      Runtime rt = Runtime.getRuntime(); +        String fontName = 
fontDictionary.getNameAsString(COSName.BASE_FONT); +        
System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  used=%6.1fMB%n", +     
       fontName, +            rt.freeMemory() / 1024.0 / 1024.0, +            
(rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);          readWidths(); 
+        System.out.printf("[PDCIDFont] after readWidths  %-32s  widths.size=%d 
 free=%6.1fMB%n", +            fontName, widths.size(), rt.freeMemory() / 
1024.0 / 1024.0);          readVerticalDisplacements(); +        
System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  ranges=%d  
free=%6.1fMB%n", +            fontName, verticalDisplacementY.size(), 
vdRangeFirst.length, +            rt.freeMemory() / 1024.0 / 1024.0);      }
 
     private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
                     COSNumber w1y = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1x = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1y = (COSNumber) w2Array.getObject(++i);

 -                    for (int cid = first; cid <= last; cid++)
 -                     \{ -                        
verticalDisplacementY.put(cid, w1y.floatValue()); -                        
positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -     
               }+                    // Store as a compact range entry instead 
of expanding to per-CID HashMap entries.
+                    // This avoids allocating thousands of boxed 
Integer/Float/Vector objects
+                    // when a single range covers many CIDs (e.g. 0..16000).
+                    int n = vdRangeFirst.length;
+                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
+                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
+                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
+                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
+                    vdRangeFirst[n] = first;
+                    vdRangeLast[n]  = last;
+                    vdRangeW1y[n]   = w1y.floatValue();
+                    vdRangeVx[n]    = v1x.floatValue();
+                    vdRangeVy[n]    = v1y.floatValue();
                 }
             }
         }
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public Vector getPositionVector(int code)
      \{          int cid = codeToCID(code); +        // Check individual 
(array-format) entries first          Vector v = positionVectors.get(cid); -    
    if (v == null) +        if (v != null) +         { +            return v; + 
       }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
         
Unknown macro: \{ -            v = getDefaultPositionVector(cid); +            
if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i]) +             \{ +         
       return new Vector(vdRangeVx[i], vdRangeVy[i]); +            }         }

 -        return v;
+        return getDefaultPositionVector(cid);
     }
 
     /**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public float getVerticalDisplacementVectorY(int code)
      {          int cid = codeToCID(code); +        // Check individual 
(array-format) entries first          Float w1y = 
verticalDisplacementY.get(cid); -        if (w1y == null) +        if (w1y != 
null)           { -            w1y = dw2[1]; +            return w1y; +        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
+        
Unknown macro: {+            if (cid >= vdRangeFirst[i] && cid <= 
vdRangeLast[i])+             \{ +                return vdRangeW1y[i]; +        
    }
         }

 -        return w1y;
+        return dw2[1];
     }
 
     @Override

```

To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in 
W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects. 
Instead, the range information is now stored in {*}{{*}}small primitive 
arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and 
`Vector` objects and prevents {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
 * Changes
 ** Added `Arrays` to the imports (used for array expansion).

 * 
 ** Added new fields:

                    * `vdRangeFirst` / `vdRangeLast` (`int[]`)
                     * `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
                These arrays store the values for each {*}{{*}}range entry 
(first..last){{*}}{*}.
 * Added memory status logging in the constructor:

  * Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and 
{*}{{*}}after `readVD`{{*}}{*}.
 * Changes in `readVerticalDisplacements()`:

  * The existing {*}{{*}}array format (individual entries){{*}}{*} is handled 
as before and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
  * When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is 
encountered, instead of looping from `first..last` and inserting each value 
into the `HashMap`, the code now:

    * Expands the `vdRange*` arrays using `Arrays.copyOf`
    * Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
  * Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*} 
for large ranges (e.g., 16,000 entries).
 * Changes in the lookup logic (`getPositionVector` / 
`getVerticalDisplacementVectorY`):

  1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` / 
`verticalDisplacementY`).

     * If found, return that value.
  2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see if 
the CID falls within a stored range.

     * If a matching range is found, return the corresponding value.
  3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
 # 
 ## Why this works (briefly)

 * {*}{{*}}Before:{{*}}{*}
  Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001 
`Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`, 
leading to large memory overhead.

 * {*}{{*}}Now:{{*}}{*}
  The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing 
memory usage to {*}{{*}}only a few dozen bytes per range{{*}}{*}.

—

[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
```diff
diff --git 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
index 3b970edd24..5b574d8c66 100644
 -    private final Map<COSName, SoftReference<PDFont>> directFontCache = new 
HashMap<>();
+    private final Map<COSName, PDFont> directFontCache = new HashMap<>();
 
     /**
      * Constructor.

```
 * {*}{{*}}What was changed{{*}}{*}

  * The type of `directFontCache` was changed:

    * {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
    * {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
  * As a result, the `SoftReference` import was removed, and the `get`/`put` 
logic was rewritten to store and retrieve `PDFont` directly instead of going 
through `SoftReference`.
 * {*}{{*}}Why this was changed (problem description){{*}}{*}

  * Previously, `SoftReference` was used so that the JVM could automatically 
clear the referenced `PDFont` objects when memory became tight.
  * However, when the reference was cleared, the same font would be parsed 
again the next time it was requested. Since font parsing is expensive, this 
could happen repeatedly, causing excessive memory allocations.
  * In some cases this repeated parsing led to unnecessary memory pressure and 
eventually an `OutOfMemoryError`.
  * By switching to strong references, the cache is no longer cleared 
unpredictably by the GC, preventing this repeated parse cycle.
 * {*}{{*}}Behavior after the change (effect){{*}}{*}

  * The same embedded font will no longer be parsed multiple times within the 
same page or the same `PDResources`.
  * This reduces unnecessary memory allocations and helps prevent 
`OutOfMemoryError`.
  * Fonts are expected to be released according to the lifecycle of 
`PDResources` / `PDDocument`, typically when page processing is finished.

  was:
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
font parsing / text extraction even with modest heap sizes. {color}

*<Observed symptom>*
 - OOM occurs while creating PDFont / PDCIDFont instances during text 
extraction.
 - Problem appears when fonts are embedded as non-indirect objects and when W2 
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).

 * Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
clears soft references, causing cached font objects to be discarded. Subsequent 
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle 
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID 
HashMap entries (boxed Integer/Float and Vector objects). A single large range 
(e.g. 0..16000) causes creation of tens of thousands of objects and large 
HashMap memory overhead, causing immediate heap exhaustion.
 * Suggested fixes (implementation-level guidance)
 -- Avoid relying on SoftReference for the per-resource direct font cache for 
non-indirect embedded fonts. Use strong references scoped to PDResources (or 
make the behavior configurable). PDResources is freed with the document 
lifecycle, so strong references prevent repeated re-parsing without leaking 
across documents.
 -- Do not expand large W2 ranges into individual boxed map entries. Parse and 
store W2 ranges compactly (e.g. range list with primitive arrays or small 
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
use a compact index). This avoids creating thousands of Integer/Float/Vector 
objects for wide ranges.
 -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
regressions, and add a memory-use test if possible.

 * Why fix should be upstream

 * 
 -- This is a parser/runtime efficiency bug that affects robustness for 
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
large allocations across all users.

—

*<Example>*

[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
         // we don't actually read the complete table here because it can 
contain tens of thousands of glyphs
         // cache the relevant part of the font data so that the data stream 
can be closed if it is no longer needed
         byte[] dataBytes = data.read((int) getLength());
+        Runtime rt = Runtime.getRuntime();
+        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
free=%6.1fMB  used=%6.1fMB%n",
+            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+            rt.freeMemory() / 1024.0 / 1024.0,
+            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
         try (RandomAccessReadBuffer read = new 
RandomAccessReadBuffer(dataBytes))
         {
             this.data = new RandomAccessReadDataStream(read);
```

This is a log that prints the read size and the JVM memory state (free/used) 
immediately after loading the byte array of the GlyphTable, in order to 
visualize how much memory was consumed and whether the heap is becoming 
constrained.

In this investigation, it helped confirm that “the heap decreases with each 
load and then recovers after GC,” which was useful for tracing the cause of the 
OutOfMemoryError (OOM).

—

[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
 
```diff
@@ -17,7 +17,6 @@
 package org.apache.pdfbox.pdmodel;
 
 import java.io.IOException;
-import java.lang.ref.SoftReference;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.Map;
@@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
 import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
 import org.apache.pdfbox.pdmodel.font.PDFont;
 import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
 import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
 import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
 
 /**
  * A set of resources available at the page/pages/stream level.
@@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
     
     // PDFBOX-3442 cache fonts that are not indirect objects, as these aren't 
cached in ResourceCache
     // and this would result in huge memory footprint in text extraction
 -    private final Map<COSName, SoftReference<PDFont>> directFontCache;
+    // NOTE: changed from SoftReference to strong reference to prevent GC 
clearing under memory pressure
+    // causing repeated re-parse of large CID fonts (death spiral leading to 
OOM)
+    private final Map<COSName, PDFont> directFontCache;
 
     /**
      * Constructor for embedding.
@@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
      * @param directFontCache The document's direct font cache. Must be mutable
      */
     public PDResources(COSDictionary resourceDictionary, ResourceCache 
resourceCache,
 -            Map<COSName, SoftReference<PDFont>> directFontCache)
+            Map<COSName, PDFont> directFontCache)
      \{          if (resourceDictionary == null)           \{ @@ -152,14 
+153,11 @@ public final class PDResources implements COSObjectable          }
         else if (indirect == null)
         {

 -            SoftReference<PDFont> ref = directFontCache.get(name);
 -            if (ref != null)
+            System.out.println("Font " + name + " is not an indirect object, 
caching in directFontCache");
+            PDFont cached = directFontCache.get(name);
+            if (cached != null)
              \{ -                PDFont cached = ref.get(); -                
if (cached != null) -                 \{ -                    return cached; -  
              }
+                return cached;
             }
         }
 
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
         }
         else if (indirect == null)
         
{ -            directFontCache.put(name, new SoftReference<>(font)); +          
  directFontCache.put(name, font);          }
         return font;
     }
```

 # 
 ## What Was Changed (Key Code Changes)

 * The `PDResources` field was changed from:

  * {*}{{*}}Before:{{*}}{*}
    `Map<COSName, SoftReference<PDFont>> directFontCache`

  * {*}{{*}}After:{{*}}{*}
    `Map<COSName, PDFont> directFontCache`
 * The constructor parameter type was also updated accordingly:
  `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.

 * In `getFont(...)`, the retrieval logic was simplified:

  * {*}{{*}}Before:{{*}}{*}
    Retrieved a `SoftReference` and then called `ref.get()` to obtain the 
`PDFont`.

  * {*}{{*}}After:{{*}}{*}
    The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
 * The caching logic was also updated:

  * {*}{{*}}Before:{{*}}{*}
    `directFontCache.put(name, new SoftReference<>(font))`

  * {*}{{*}}After:{{*}}{*}
    `directFontCache.put(name, font)`
 * Additionally, a debug message was added:

  ```java
  System.out.println("Font " + name + " is not an indirect object, caching in 
directFontCache");
  ```

—

Original Problem (Why the Fix Was Necessary)

The original implementation stored {*}{{*}}fonts embedded directly as 
dictionaries{{*}}{*} using `SoftReference`.

`SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects 
when memory becomes low{{*}}{*}.

This created a problem:

1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system {*}{{*}}re-parses the 
font from the PDF{{*}}{*}.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:

```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```

This {*}{{*}}reparse loop{{*}}{*} can eventually cause an 
{*}{{*}}OutOfMemoryError (OOM){{*}}{*}.

The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK 
fonts){{*}}{*}, because parsing them creates very large in-memory structures.

—

What This Fix Improves

By switching `directFontCache` to {*}{{*}}strong references (`PDFont`){{*}}{*}, 
the JVM can no longer clear the cached fonts automatically.

This prevents the cycle:

```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```

As a result, the system {*}{{*}}stops repeatedly parsing the same 
fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM 
scenario.

—

[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
 
 import java.io.IOException;
 import java.io.InputStream;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.Map;
+
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
-
 import org.apache.pdfbox.cos.COSArray;
 import org.apache.pdfbox.cos.COSBase;
 import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     private float defaultWidth;
     private float averageWidth;
 
 -    private final Map<Integer, Float> verticalDisplacementY = new 
HashMap<>(); // w1y
 -    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v
+    private final Map<Integer, Float> verticalDisplacementY = new HashMap<>(); 
// w1y (individual entries)
+    private final Map<Integer, Vector> positionVectors = new HashMap<>();     
// v   (individual entries)
+    // Range-based W2 entries stored as compact primitive arrays to avoid 
HashMap boxing overhead.
+    // A single range entry (first..last) replaces thousands of individual 
HashMap entries.
+    private int[] vdRangeFirst = new int[0];
+    private int[] vdRangeLast  = new int[0];
+    private float[] vdRangeW1y = new float[0];
+    private float[] vdRangeVx  = new float[0];
+    private float[] vdRangeVy  = new float[0];
     private final float[] dw2 = new float[] \{ 880, -1000 };
 
     protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
      {          this.dict = fontDictionary;          this.parent = parent; +   
     Runtime rt = Runtime.getRuntime(); +        String fontName = 
fontDictionary.getNameAsString(COSName.BASE_FONT); +        
System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  used=%6.1fMB%n", +     
       fontName, +            rt.freeMemory() / 1024.0 / 1024.0, +            
(rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);          readWidths(); 
+        System.out.printf("[PDCIDFont] after readWidths  %-32s  widths.size=%d 
 free=%6.1fMB%n", +            fontName, widths.size(), rt.freeMemory() / 
1024.0 / 1024.0);          readVerticalDisplacements(); +        
System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  ranges=%d  
free=%6.1fMB%n", +            fontName, verticalDisplacementY.size(), 
vdRangeFirst.length, +            rt.freeMemory() / 1024.0 / 1024.0);      }
 
     private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
                     COSNumber w1y = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1x = (COSNumber) w2Array.getObject(++i);
                     COSNumber v1y = (COSNumber) w2Array.getObject(++i);

 -                    for (int cid = first; cid <= last; cid++)
 -                     \{ -                        
verticalDisplacementY.put(cid, w1y.floatValue()); -                        
positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -     
               }+                    // Store as a compact range entry instead 
of expanding to per-CID HashMap entries.
+                    // This avoids allocating thousands of boxed 
Integer/Float/Vector objects
+                    // when a single range covers many CIDs (e.g. 0..16000).
+                    int n = vdRangeFirst.length;
+                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
+                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
+                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
+                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
+                    vdRangeFirst[n] = first;
+                    vdRangeLast[n]  = last;
+                    vdRangeW1y[n]   = w1y.floatValue();
+                    vdRangeVx[n]    = v1x.floatValue();
+                    vdRangeVy[n]    = v1y.floatValue();
                 }
             }
         }
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public Vector getPositionVector(int code)
      \{          int cid = codeToCID(code); +        // Check individual 
(array-format) entries first          Vector v = positionVectors.get(cid); -    
    if (v == null) +        if (v != null) +         \{ +            return v; 
+        }
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
         
{ -            v = getDefaultPositionVector(cid); +            if (cid >= 
vdRangeFirst[i] && cid <= vdRangeLast[i]) +             \\{ +                
return new Vector(vdRangeVx[i], vdRangeVy[i]); +            }
         }

 -        return v;
+        return getDefaultPositionVector(cid);
     }
 
     /**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable, 
PDFontLike, PDVectorFo
     public float getVerticalDisplacementVectorY(int code)
      \{          int cid = codeToCID(code); +        // Check individual 
(array-format) entries first          Float w1y = 
verticalDisplacementY.get(cid); -        if (w1y == null) +        if (w1y != 
null)           \{ -            w1y = dw2[1]; +            return w1y; +        
}
+        // Check compact range entries
+        for (int i = 0; i < vdRangeFirst.length; i++)
+        
Unknown macro: {+            if (cid >= vdRangeFirst[i] && cid <= 
vdRangeLast[i])+            
{ +                return vdRangeW1y[i]; +            }
         }

 -        return w1y;
+        return dw2[1];
     }
 
     @Override

```

To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in 
W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects. 
Instead, the range information is now stored in {*}{{*}}small primitive 
arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and 
`Vector` objects and prevents *{*}OutOfMemoryError (OOM){*}*.
 * Changes
 ** Added `Arrays` to the imports (used for array expansion).

 * 
 ** Added new fields:

                    * `vdRangeFirst` / `vdRangeLast` (`int[]`)
                     * `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
                These arrays store the values for each {*}{{*}}range entry 
(first..last){{*}}{*}.
 * Added memory status logging in the constructor:

  * Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and 
{*}{{*}}after `readVD`{{*}}{*}.
 * Changes in `readVerticalDisplacements()`:

  * The existing {*}{{*}}array format (individual entries){{*}}{*} is handled 
as before and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
  * When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is 
encountered, instead of looping from `first..last` and inserting each value 
into the `HashMap`, the code now:

    * Expands the `vdRange*` arrays using `Arrays.copyOf`
    * Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
  * Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*} 
for large ranges (e.g., 16,000 entries).
 * Changes in the lookup logic (`getPositionVector` / 
`getVerticalDisplacementVectorY`):

  1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` / 
`verticalDisplacementY`).

     * If found, return that value.
  2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see if 
the CID falls within a stored range.

     * If a matching range is found, return the corresponding value.
  3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
 # 
 ## Why this works (briefly)

 * {*}{{*}}Before:{{*}}{*}
  Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001 
`Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`, 
leading to large memory overhead.

 * {*}{{*}}Now:{{*}}{*}
  The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing 
memory usage to *{*}only a few dozen bytes per range{*}*.

—

[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
```diff
diff --git 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
index 3b970edd24..5b574d8c66 100644
 -    private final Map<COSName, SoftReference<PDFont>> directFontCache = new 
HashMap<>();
+    private final Map<COSName, PDFont> directFontCache = new HashMap<>();
 
     /**
      * Constructor.

```
 * {*}{{*}}What was changed{{*}}{*}

  * The type of `directFontCache` was changed:

    * {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
    * {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
  * As a result, the `SoftReference` import was removed, and the `get`/`put` 
logic was rewritten to store and retrieve `PDFont` directly instead of going 
through `SoftReference`.
 * {*}{{*}}Why this was changed (problem description){{*}}{*}

  * Previously, `SoftReference` was used so that the JVM could automatically 
clear the referenced `PDFont` objects when memory became tight.
  * However, when the reference was cleared, the same font would be parsed 
again the next time it was requested. Since font parsing is expensive, this 
could happen repeatedly, causing excessive memory allocations.
  * In some cases this repeated parsing led to unnecessary memory pressure and 
eventually an `OutOfMemoryError`.
  * By switching to strong references, the cache is no longer cleared 
unpredictably by the GC, preventing this repeated parse cycle.
 * {*}{{*}}Behavior after the change (effect){{*}}{*}

  * The same embedded font will no longer be parsed multiple times within the 
same page or the same `PDResources`.
  * This reduces unnecessary memory allocations and helps prevent 
`OutOfMemoryError`.
  * Fonts are expected to be released according to the lifecycle of 
`PDResources` / `PDDocument`, typically when page processing is finished.


> OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + 
> W2 range expansion leads to OOM
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
>    Affects Versions: 3.0.5 PDFBox, 3.0.6 PDFBox, 3.0.7 PDFBox, 3.0.4 JBIG2
>         Environment: openjdk 21.0.7 2025-04-15
> OpenJDK Runtime Environment Homebrew (build 21.0.7)
> OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)
> pdfbox version:3.0.4
>            Reporter: Kiyotsuki Suzuki
>            Priority: Major
>              Labels: performance
>             Fix For: 3.0.4 JBIG2
>
>         Attachments: 6.pdf
>
>
> {color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
> and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
> font parsing / text extraction even with modest heap sizes. {color}
> *<Observed symptom>*
>  - OOM occurs while creating PDFont / PDCIDFont instances during text 
> extraction.
>  - Problem appears when fonts are embedded as non-indirect objects and when 
> W2 (vertical metrics) contains large ranges (e.g. first..last spanning many 
> CIDs).
>  * Likely root causes (two cooperating issues)
> 1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
> clears soft references, causing cached font objects to be discarded. 
> Subsequent uses re-parse the same (heavy) font repeatedly. This GC -> 
> re-parse -> GC cycle can escalate memory usage and trigger OOM.
> 2. W2 range entries (first last w1y v.x v.y) are expanded naively into 
> per-CID HashMap entries (boxed Integer/Float and Vector objects). A single 
> large range (e.g. 0..16000) causes creation of tens of thousands of objects 
> and large HashMap memory overhead, causing immediate heap exhaustion.
>  * Suggested fixes (implementation-level guidance)
>  -- Avoid relying on SoftReference for the per-resource direct font cache for 
> non-indirect embedded fonts. Use strong references scoped to PDResources (or 
> make the behavior configurable). PDResources is freed with the document 
> lifecycle, so strong references prevent repeated re-parsing without leaking 
> across documents.
>  -- Do not expand large W2 ranges into individual boxed map entries. Parse 
> and store W2 ranges compactly (e.g. range list with primitive arrays or small 
> objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
> use a compact index). This avoids creating thousands of Integer/Float/Vector 
> objects for wide ranges.
>  -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
> regressions, and add a memory-use test if possible.
>  * Why fix should be upstream
>  ** This is a parser/runtime efficiency bug that affects robustness for 
> real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
> large allocations across all users.
> —
> *<Example>*
> [/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
> ```diff
> @@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
>          // we don't actually read the complete table here because it can 
> contain tens of thousands of glyphs
>          // cache the relevant part of the font data so that the data stream 
> can be closed if it is no longer needed
>          byte[] dataBytes = data.read((int) getLength());
> +        Runtime rt = Runtime.getRuntime();
> +        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
> free=%6.1fMB  used=%6.1fMB%n",
> +            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
> +            rt.freeMemory() / 1024.0 / 1024.0,
> +            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
>          try (RandomAccessReadBuffer read = new 
> RandomAccessReadBuffer(dataBytes))
>          
> {              this.data = new RandomAccessReadDataStream(read); ``` This is 
> a log that prints the read size and the JVM memory state (free/used) 
> immediately after loading the byte array of the GlyphTable, in order to 
> visualize how much memory was consumed and whether the heap is becoming 
> constrained. In this investigation, it helped confirm that “the heap 
> decreases with each load and then recovers after GC,” which was useful for 
> tracing the cause of the OutOfMemoryError (OOM). — 
> [/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
>   ```diff @@ -17,7 +17,6 @@  package org.apache.pdfbox.pdmodel;    import 
> java.io.IOException; -import java.lang.ref.SoftReference;  import 
> java.util.Collections;  import java.util.HashMap;  import java.util.Map; @@ 
> -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;  
> import 
> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;  
> import org.apache.pdfbox.pdmodel.font.PDFont;  import 
> org.apache.pdfbox.pdmodel.font.PDFontFactory; +import 
> org.apache.pdfbox.pdmodel.graphics.PDXObject; +import 
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;  import 
> org.apache.pdfbox.pdmodel.graphics.color.PDPattern;  import 
> org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; +import 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;  import 
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup; 
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState; 
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;  import 
> org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;  import 
> org.apache.pdfbox.pdmodel.graphics.shading.PDShading; -import 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; -import 
> org.apache.pdfbox.pdmodel.graphics.PDXObject; +import 
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;    /**   * 
> A set of resources available at the page/pages/stream level. @@ -54,7 +53,9 
> @@ public final class PDResources implements COSObjectable            // 
> PDFBOX-3442 cache fonts that are not indirect objects, as these aren't cached 
> in ResourceCache      // and this would result in huge memory footprint in 
> text extraction -    private final Map<COSName, SoftReference<PDFont>> 
> directFontCache; +    // NOTE: changed from SoftReference to strong reference 
> to prevent GC clearing under memory pressure +    // causing repeated 
> re-parse of large CID fonts (death spiral leading to OOM) +    private final 
> Map<COSName, PDFont> directFontCache;        /**       * Constructor for 
> embedding. @@ -107,7 +108,7 @@ public final class PDResources implements 
> COSObjectable       * @param directFontCache The document's direct font 
> cache. Must be mutable       */      public PDResources(COSDictionary 
> resourceDictionary, ResourceCache resourceCache, -            Map<COSName, 
> SoftReference<PDFont>> directFontCache) +            Map<COSName, PDFont> 
> directFontCache)       \\{          if (resourceDictionary == null)           
> \{ @@ -152,14 +153,11 @@ public final class PDResources implements 
> COSObjectable          }
>          else if (indirect == null)
>          
> { -            SoftReference<PDFont> ref = directFontCache.get(name); -       
>      if (ref != null) +            System.out.println("Font " + name + " is 
> not an indirect object, caching in directFontCache"); +            PDFont 
> cached = directFontCache.get(name); +            if (cached != null)          
>      \\{ -                PDFont cached = ref.get(); -                if 
> (cached != null) -                 \{ -                    return cached; -   
>              }
> +                return cached;
>              }
>          }
>  
> @@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
>          }
>          else if (indirect == null)
>          
> { -            directFontCache.put(name, new SoftReference<>(font)); +        
>     directFontCache.put(name, font);          }
>          return font;
>      }
> ```
>  # 
>  ## What Was Changed (Key Code Changes)
>  * The `PDResources` field was changed from:
>   * {*}{{*}}Before:{{*}}{*}
>     `Map<COSName, SoftReference<PDFont>> directFontCache`
>   * {*}{{*}}After:{{*}}{*}
>     `Map<COSName, PDFont> directFontCache`
>  * The constructor parameter type was also updated accordingly:
>   `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
>  * In `getFont(...)`, the retrieval logic was simplified:
>   * {*}{{*}}Before:{{*}}{*}
>     Retrieved a `SoftReference` and then called `ref.get()` to obtain the 
> `PDFont`.
>   * {*}{{*}}After:{{*}}{*}
>     The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
>  * The caching logic was also updated:
>   * {*}{{*}}Before:{{*}}{*}
>     `directFontCache.put(name, new SoftReference<>(font))`
>   * {*}{{*}}After:{{*}}{*}
>     `directFontCache.put(name, font)`
>  * Additionally, a debug message was added:
>   ```java
>   System.out.println("Font " + name + " is not an indirect object, caching in 
> directFontCache");
>   ```
> —
> Original Problem (Why the Fix Was Necessary)
> The original implementation stored {*}{{*}}fonts embedded directly as 
> dictionaries{{*}}{*} using `SoftReference`.
> `SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects 
> when memory becomes low{{*}}{*}.
> This created a problem:
> 1. When memory pressure occurs, the JVM clears the cached fonts.
> 2. The next time the same font is requested, the system {*}{{*}}re-parses the 
> font from the PDF{{*}}{*}.
> 3. Font parsing can allocate large temporary structures.
> 4. Under memory pressure, this leads to a loop:
> ```
> GC clears font cache
> → font requested again
> → font parsed again
> → large memory allocation
> → GC clears again
> → repeat
> ```
> This {*}{{*}}reparse loop{{*}}{*} can eventually cause an 
> {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
> The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK 
> fonts){{*}}{*}, because parsing them creates very large in-memory structures.
> —
> What This Fix Improves
> By switching `directFontCache` to {*}{{*}}strong references 
> (`PDFont`){{*}}{*}, the JVM can no longer clear the cached fonts 
> automatically.
> This prevents the cycle:
> ```
> memory pressure
> → cache cleared
> → font re-parsed
> → more memory pressure
> ```
> As a result, the system {*}{{*}}stops repeatedly parsing the same 
> fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM 
> scenario.
> —
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
> ]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
> ```diff
> @@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
>  
>  import java.io.IOException;
>  import java.io.InputStream;
> +import java.util.Arrays;
>  import java.util.HashMap;
>  import java.util.Map;
> +
>  import org.apache.commons.logging.Log;
>  import org.apache.commons.logging.LogFactory;
> -
>  import org.apache.pdfbox.cos.COSArray;
>  import org.apache.pdfbox.cos.COSBase;
>  import org.apache.pdfbox.cos.COSDictionary;
> @@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>      private float defaultWidth;
>      private float averageWidth;
>  
>  -    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y
>  -    private final Map<Integer, Vector> positionVectors = new HashMap<>();   
>   // v
> +    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y (individual entries)
> +    private final Map<Integer, Vector> positionVectors = new HashMap<>();    
>  // v   (individual entries)
> +    // Range-based W2 entries stored as compact primitive arrays to avoid 
> HashMap boxing overhead.
> +    // A single range entry (first..last) replaces thousands of individual 
> HashMap entries.
> +    private int[] vdRangeFirst = new int[0];
> +    private int[] vdRangeLast  = new int[0];
> +    private float[] vdRangeW1y = new float[0];
> +    private float[] vdRangeVx  = new float[0];
> +    private float[] vdRangeVy  = new float[0];
>      private final float[] dw2 = new float[] \{ 880, -1000 };
>  
>      protected final COSDictionary dict;
> @@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>       \{          this.dict = fontDictionary;          this.parent = parent; 
> +        Runtime rt = Runtime.getRuntime(); +        String fontName = 
> fontDictionary.getNameAsString(COSName.BASE_FONT); +        
> System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  used=%6.1fMB%n", +   
>          fontName, +            rt.freeMemory() / 1024.0 / 1024.0, +          
>   (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);          
> readWidths(); +        System.out.printf("[PDCIDFont] after readWidths  %-32s 
>  widths.size=%d  free=%6.1fMB%n", +            fontName, widths.size(), 
> rt.freeMemory() / 1024.0 / 1024.0);          readVerticalDisplacements(); +   
>      System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  
> ranges=%d  free=%6.1fMB%n", +            fontName, 
> verticalDisplacementY.size(), vdRangeFirst.length, +            
> rt.freeMemory() / 1024.0 / 1024.0);      }
>  
>      private void readWidths()
> @@ -180,11 +199,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>                      COSNumber w1y = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1x = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1y = (COSNumber) w2Array.getObject(++i);
>  -                    for (int cid = first; cid <= last; cid++)
>  -                     \{ -                        
> verticalDisplacementY.put(cid, w1y.floatValue()); -                        
> positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -   
>                  }+                    // Store as a compact range entry 
> instead of expanding to per-CID HashMap entries.
> +                    // This avoids allocating thousands of boxed 
> Integer/Float/Vector objects
> +                    // when a single range covers many CIDs (e.g. 0..16000).
> +                    int n = vdRangeFirst.length;
> +                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
> +                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
> +                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
> +                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
> +                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
> +                    vdRangeFirst[n] = first;
> +                    vdRangeLast[n]  = last;
> +                    vdRangeW1y[n]   = w1y.floatValue();
> +                    vdRangeVx[n]    = v1x.floatValue();
> +                    vdRangeVy[n]    = v1y.floatValue();
>                  }
>              }
>          }
> @@ -288,12 +317,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public Vector getPositionVector(int code)
>       \{          int cid = codeToCID(code); +        // Check individual 
> (array-format) entries first          Vector v = positionVectors.get(cid); -  
>       if (v == null) +        if (v != null) +         { +            return 
> v; +        }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
>          
> Unknown macro: \{ -            v = getDefaultPositionVector(cid); +           
>  if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i]) +             \{ +      
>           return new Vector(vdRangeVx[i], vdRangeVy[i]); +            }       
>   }
>  -        return v;
> +        return getDefaultPositionVector(cid);
>      }
>  
>      /**
> @@ -305,12 +343,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public float getVerticalDisplacementVectorY(int code)
>       {          int cid = codeToCID(code); +        // Check individual 
> (array-format) entries first          Float w1y = 
> verticalDisplacementY.get(cid); -        if (w1y == null) +        if (w1y != 
> null)           { -            w1y = dw2[1]; +            return w1y; +       
>  }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
> +        
> Unknown macro: {+            if (cid >= vdRangeFirst[i] && cid <= 
> vdRangeLast[i])+             \{ +                return vdRangeW1y[i]; +      
>       }
>          }
>  -        return w1y;
> +        return dw2[1];
>      }
>  
>      @Override
> ```
> To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in 
> W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects. 
> Instead, the range information is now stored in {*}{{*}}small primitive 
> arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and 
> `Vector` objects and prevents {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
>  * Changes
>  ** Added `Arrays` to the imports (used for array expansion).
>  * 
>  ** Added new fields:
>                     * `vdRangeFirst` / `vdRangeLast` (`int[]`)
>                      * `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
>                 These arrays store the values for each {*}{{*}}range entry 
> (first..last){{*}}{*}.
>  * Added memory status logging in the constructor:
>   * Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and 
> {*}{{*}}after `readVD`{{*}}{*}.
>  * Changes in `readVerticalDisplacements()`:
>   * The existing {*}{{*}}array format (individual entries){{*}}{*} is handled 
> as before and stored in `HashMap`s (`verticalDisplacementY`, 
> `positionVectors`).
>   * When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is 
> encountered, instead of looping from `first..last` and inserting each value 
> into the `HashMap`, the code now:
>     * Expands the `vdRange*` arrays using `Arrays.copyOf`
>     * Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
>   * Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*} 
> for large ranges (e.g., 16,000 entries).
>  * Changes in the lookup logic (`getPositionVector` / 
> `getVerticalDisplacementVectorY`):
>   1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` / 
> `verticalDisplacementY`).
>      * If found, return that value.
>   2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see 
> if the CID falls within a stored range.
>      * If a matching range is found, return the corresponding value.
>   3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
>  # 
>  ## Why this works (briefly)
>  * {*}{{*}}Before:{{*}}{*}
>   Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001 
> `Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`, 
> leading to large memory overhead.
>  * {*}{{*}}Now:{{*}}{*}
>   The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing 
> memory usage to {*}{{*}}only a few dozen bytes per range{{*}}{*}.
> —
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>  
> ]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
> ```diff
> diff --git 
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>  
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> index 3b970edd24..5b574d8c66 100644
>  -    private final Map<COSName, SoftReference<PDFont>> directFontCache = new 
> HashMap<>();
> +    private final Map<COSName, PDFont> directFontCache = new HashMap<>();
>  
>      /**
>       * Constructor.
> ```
>  * {*}{{*}}What was changed{{*}}{*}
>   * The type of `directFontCache` was changed:
>     * {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
>     * {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
>   * As a result, the `SoftReference` import was removed, and the `get`/`put` 
> logic was rewritten to store and retrieve `PDFont` directly instead of going 
> through `SoftReference`.
>  * {*}{{*}}Why this was changed (problem description){{*}}{*}
>   * Previously, `SoftReference` was used so that the JVM could automatically 
> clear the referenced `PDFont` objects when memory became tight.
>   * However, when the reference was cleared, the same font would be parsed 
> again the next time it was requested. Since font parsing is expensive, this 
> could happen repeatedly, causing excessive memory allocations.
>   * In some cases this repeated parsing led to unnecessary memory pressure 
> and eventually an `OutOfMemoryError`.
>   * By switching to strong references, the cache is no longer cleared 
> unpredictably by the GC, preventing this repeated parse cycle.
>  * {*}{{*}}Behavior after the change (effect){{*}}{*}
>   * The same embedded font will no longer be parsed multiple times within the 
> same page or the same `PDResources`.
>   * This reduces unnecessary memory allocations and helps prevent 
> `OutOfMemoryError`.
>   * Fonts are expected to be released according to the lifecycle of 
> `PDResources` / `PDDocument`, typically when page processing is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-6175) OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + W2 range expansion leads to OOM

Reply via email to