[jira] [Updated] (PDFBOX-6175) OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + W2 range expansion leads to OOM

Kiyotsuki Suzuki (Jira) Thu, 12 Mar 2026 01:17:19 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kiyotsuki Suzuki updated PDFBOX-6175:
-------------------------------------
    Issue Type: Improvement  (was: Bug)

> OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + 
> W2 range expansion leads to OOM
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6175
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
>    Affects Versions: 3.0.5 PDFBox, 3.0.6 PDFBox, 3.0.7 PDFBox, 3.0.4 JBIG2
>         Environment: openjdk 21.0.7 2025-04-15
> OpenJDK Runtime Environment Homebrew (build 21.0.7)
> OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)
> pdfbox version:3.0.4
> java -Xmx512M
>            Reporter: Kiyotsuki Suzuki
>            Priority: Major
>              Labels: performance
>             Fix For: 3.0.4 JBIG2
>
>         Attachments: 6.pdf
>
>
> {color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs 
> and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during 
> font parsing / text extraction even with modest heap sizes. {color}
> *<Observed symptom>*
>  - OOM occurs while creating PDFont / PDCIDFont instances during text 
> extraction.
>  - Problem appears when fonts are embedded as non-indirect objects and when 
> W2 (vertical metrics) contains large ranges (e.g. first..last spanning many 
> CIDs).
>  * Likely root causes (two cooperating issues)
> 1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM 
> clears soft references, causing cached font objects to be discarded. 
> Subsequent uses re-parse the same (heavy) font repeatedly. This GC -> 
> re-parse -> GC cycle can escalate memory usage and trigger OOM.
> 2. W2 range entries (first last w1y v.x v.y) are expanded naively into 
> per-CID HashMap entries (boxed Integer/Float and Vector objects). A single 
> large range (e.g. 0..16000) causes creation of tens of thousands of objects 
> and large HashMap memory overhead, causing immediate heap exhaustion.
>  * Suggested fixes (implementation-level guidance)
>  -- Avoid relying on SoftReference for the per-resource direct font cache for 
> non-indirect embedded fonts. Use strong references scoped to PDResources (or 
> make the behavior configurable). PDResources is freed with the document 
> lifecycle, so strong references prevent repeated re-parsing without leaking 
> across documents.
>  -- Do not expand large W2 ranges into individual boxed map entries. Parse 
> and store W2 ranges compactly (e.g. range list with primitive arrays or small 
> objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or 
> use a compact index). This avoids creating thousands of Integer/Float/Vector 
> objects for wide ranges.
>  -- Add tests exercising large CID fonts with wide W2 ranges to guard against 
> regressions, and add a memory-use test if possible.
>  * Why fix should be upstream
>  ** This is a parser/runtime efficiency bug that affects robustness for 
> real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and 
> large allocations across all users.
>  
> *<Example>*
> {*}[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]{*}([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
> {code:java}
> // code placeholder
> ### 
> [/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
> ```diff
> @@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
>          // we don't actually read the complete table here because it can 
> contain tens of thousands of glyphs
>          // cache the relevant part of the font data so that the data stream 
> can be closed if it is no longer needed
>          byte[] dataBytes = data.read((int) getLength());
> +        Runtime rt = Runtime.getRuntime();
> +        System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs  
> free=%6.1fMB  used=%6.1fMB%n",
> +            dataBytes.length / (1024.0 * 1024.0), numGlyphs,
> +            rt.freeMemory() / 1024.0 / 1024.0,
> +            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
>          try (RandomAccessReadBuffer read = new 
> RandomAccessReadBuffer(dataBytes))
>          {
>              this.data = new RandomAccessReadDataStream(read);
> ``` {code}
> This is a log that prints the read size and the JVM memory state (free/used) 
> immediately after loading the byte array of the GlyphTable, in order to 
> visualize how much memory was consumed and whether the heap is becoming 
> constrained.In this investigation, it helped confirm that “the heap decreases 
> with each load and then recovers after GC,” which was useful for tracing the 
> cause of the OutOfMemoryError (OOM). 
>  
> *[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java*
> ([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java|http://example.com/])
>  
>  
> {code:java}
> // code placeholder
> @@ -17,7 +17,6 @@
>  package org.apache.pdfbox.pdmodel;
>  
>  import java.io.IOException;
> -import java.lang.ref.SoftReference;
>  import java.util.Collections;
>  import java.util.HashMap;
>  import java.util.Map;
> @@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
>  import 
> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>  import org.apache.pdfbox.pdmodel.font.PDFont;
>  import org.apache.pdfbox.pdmodel.font.PDFontFactory;
> +import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
>  import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> +import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>  import 
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
>  import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
> -import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> -import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
>  
>  /**
>   * A set of resources available at the page/pages/stream level.
> @@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
>      
>      // PDFBOX-3442 cache fonts that are not indirect objects, as these 
> aren't cached in ResourceCache
>      // and this would result in huge memory footprint in text extraction
> -    private final Map<COSName, SoftReference<PDFont>> directFontCache;
> +    // NOTE: changed from SoftReference to strong reference to prevent GC 
> clearing under memory pressure
> +    // causing repeated re-parse of large CID fonts (death spiral leading to 
> OOM)
> +    private final Map<COSName, PDFont> directFontCache;
>  
>      /**
>       * Constructor for embedding.
> @@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
>       * @param directFontCache The document's direct font cache. Must be 
> mutable
>       */
>      public PDResources(COSDictionary resourceDictionary, ResourceCache 
> resourceCache,
> -            Map<COSName, SoftReference<PDFont>> directFontCache)
> +            Map<COSName, PDFont> directFontCache)
>      {
>          if (resourceDictionary == null)
>          {
> @@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
>          }
>          else if (indirect == null)
>          {
> -            SoftReference<PDFont> ref = directFontCache.get(name);
> -            if (ref != null)
> +            System.out.println("Font " + name + " is not an indirect object, 
> caching in directFontCache");
> +            PDFont cached = directFontCache.get(name);
> +            if (cached != null)
>              {
> -                PDFont cached = ref.get();
> -                if (cached != null)
> -                {
> -                    return cached;
> -                }
> +                return cached;
>              }
>          }
>  
> @@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
>          }
>          else if (indirect == null)
>          {
> -            directFontCache.put(name, new SoftReference<>(font));
> +            directFontCache.put(name, font);
>          }
>          return font;
>      } {code}
>  * What Was Changed (Key Code Changes)
>  ** The `PDResources` field was changed from:
>  *** Before: `Map<COSName, SoftReference<PDFont>> directFontCache`
>  *** After: `Map<COSName, PDFont> directFontCache`
>  ** The constructor parameter type was also updated accordingly:
>  *** `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
>  ** In `getFont(...)`, the retrieval logic was simplified:
>  *** Before: Retrieved a `SoftReference` and then called `ref.get()` to 
> obtain the `PDFont`.
>  *** After: The `PDFont` is retrieved directly, so `ref.get()` is no longer 
> needed.
>  ** The caching logic was also updated:
>  *** Before:`directFontCache.put(name, new SoftReference<>(font))`
>  *** After: `directFontCache.put(name, font)`
>  * Original Problem (Why the Fix Was Necessary)
>  ** The original implementation stored fonts embedded directly as 
> dictionaries using `SoftReference`.
>  ** `SoftReference` allows the JVM to automatically clear cached objects when 
> memory becomes low.
>  ** This created a problem:
>  *** 1. When memory pressure occurs, the JVM clears the cached fonts.
>  *** 2. The next time the same font is requested, the system re-parses the 
> font from the PDF.
>  *** 3. Font parsing can allocate large temporary structures.
>  *** 4. Under memory pressure, this leads to a loop:
>  *** ```
>                                 GC clears font cache
>                                 → font requested again
>                                 → font parsed again
>                                 → large memory allocation
>                                 → GC clears again
>                                 → repeat
>                                 ```
>  * 
>  ** 
>  *** This reparse loop can eventually cause an OutOfMemoryError (OOM). The 
> issue becomes particularly severe with CID fonts (large CJK fonts), because 
> parsing them creates very large in-memory structures.
>  ** What This Fix Improves
>  *** By switching `directFontCache` to strong references (`PDFont`), the JVM 
> can no longer clear the cached fonts automatically.
>  *** This prevents the cycle:
>  **** ```
> memory pressure
> → cache cleared
> → font re-parsed
> → more memory pressure
> ```
>  * 
>  ** 
>  *** As a result, the system stops repeatedly parsing the same fonts, 
> preventing unnecessary heap consumption and avoiding the OOM scenario.
>  
> *[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java ] 
> ([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java]
>  )*
> {code:java}
> // code placeholder
> ### [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java 
> ](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java)
> ```diff
> @@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
>  
>  import java.io.IOException;
>  import java.io.InputStream;
> +import java.util.Arrays;
>  import java.util.HashMap;
>  import java.util.Map;
> +
>  import org.apache.commons.logging.Log;
>  import org.apache.commons.logging.LogFactory;
> -
>  import org.apache.pdfbox.cos.COSArray;
>  import org.apache.pdfbox.cos.COSBase;
>  import org.apache.pdfbox.cos.COSDictionary;
> @@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>      private float defaultWidth;
>      private float averageWidth;
>  
> -    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y
> -    private final Map<Integer, Vector> positionVectors = new HashMap<>();    
>  // v
> +    private final Map<Integer, Float> verticalDisplacementY = new 
> HashMap<>(); // w1y (individual entries)
> +    private final Map<Integer, Vector> positionVectors = new HashMap<>();    
>  // v   (individual entries)
> +    // Range-based W2 entries stored as compact primitive arrays to avoid 
> HashMap boxing overhead.
> +    // A single range entry (first..last) replaces thousands of individual 
> HashMap entries.
> +    private int[] vdRangeFirst = new int[0];
> +    private int[] vdRangeLast  = new int[0];
> +    private float[] vdRangeW1y = new float[0];
> +    private float[] vdRangeVx  = new float[0];
> +    private float[] vdRangeVy  = new float[0];
>      private final float[] dw2 = new float[] { 880, -1000 };
>  
>      protected final COSDictionary dict;
> @@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable, 
> PDFontLike, PDVectorFo
>      {
>          this.dict = fontDictionary;
>          this.parent = parent;
> +        Runtime rt = Runtime.getRuntime();
> +        String fontName = fontDictionary.getNameAsString(COSName.BASE_FONT);
> +        System.out.printf("[PDCIDFont] init %-40s  free=%6.1fMB  
> used=%6.1fMB%n",
> +            fontName,
> +            rt.freeMemory() / 1024.0 / 1024.0,
> +            (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
>          readWidths();
> +        System.out.printf("[PDCIDFont] after readWidths  %-32s  
> widths.size=%d  free=%6.1fMB%n",
> +            fontName, widths.size(), rt.freeMemory() / 1024.0 / 1024.0);
>          readVerticalDisplacements();
> +        System.out.printf("[PDCIDFont] after readVD      %-32s  indiv=%d  
> ranges=%d  free=%6.1fMB%n",
> +            fontName, verticalDisplacementY.size(), vdRangeFirst.length,
> +            rt.freeMemory() / 1024.0 / 1024.0);
>      }
>  
>      private void readWidths()
> @@ -180,11 +199,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>                      COSNumber w1y = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1x = (COSNumber) w2Array.getObject(++i);
>                      COSNumber v1y = (COSNumber) w2Array.getObject(++i);
> -                    for (int cid = first; cid <= last; cid++)
> -                    {
> -                        verticalDisplacementY.put(cid, w1y.floatValue());
> -                        positionVectors.put(cid, new 
> Vector(v1x.floatValue(), v1y.floatValue()));
> -                    }
> +                    // Store as a compact range entry instead of expanding 
> to per-CID HashMap entries.
> +                    // This avoids allocating thousands of boxed 
> Integer/Float/Vector objects
> +                    // when a single range covers many CIDs (e.g. 0..16000).
> +                    int n = vdRangeFirst.length;
> +                    vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
> +                    vdRangeLast  = Arrays.copyOf(vdRangeLast,  n + 1);
> +                    vdRangeW1y   = Arrays.copyOf(vdRangeW1y,   n + 1);
> +                    vdRangeVx    = Arrays.copyOf(vdRangeVx,    n + 1);
> +                    vdRangeVy    = Arrays.copyOf(vdRangeVy,    n + 1);
> +                    vdRangeFirst[n] = first;
> +                    vdRangeLast[n]  = last;
> +                    vdRangeW1y[n]   = w1y.floatValue();
> +                    vdRangeVx[n]    = v1x.floatValue();
> +                    vdRangeVy[n]    = v1y.floatValue();
>                  }
>              }
>          }
> @@ -288,12 +317,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public Vector getPositionVector(int code)
>      {
>          int cid = codeToCID(code);
> +        // Check individual (array-format) entries first
>          Vector v = positionVectors.get(cid);
> -        if (v == null)
> +        if (v != null)
> +        {
> +            return v;
> +        }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
>          {
> -            v = getDefaultPositionVector(cid);
> +            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> +            {
> +                return new Vector(vdRangeVx[i], vdRangeVy[i]);
> +            }
>          }
> -        return v;
> +        return getDefaultPositionVector(cid);
>      }
>  
>      /**
> @@ -305,12 +343,21 @@ public abstract class PDCIDFont implements 
> COSObjectable, PDFontLike, PDVectorFo
>      public float getVerticalDisplacementVectorY(int code)
>      {
>          int cid = codeToCID(code);
> +        // Check individual (array-format) entries first
>          Float w1y = verticalDisplacementY.get(cid);
> -        if (w1y == null)
> +        if (w1y != null)
>          {
> -            w1y = dw2[1];
> +            return w1y;
> +        }
> +        // Check compact range entries
> +        for (int i = 0; i < vdRangeFirst.length; i++)
> +        {
> +            if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> +            {
> +                return vdRangeW1y[i];
> +            }
>          }
> -        return w1y;
> +        return dw2[1];
>      }
>  
>      @Override``` {code}
> To reduce memory consumption, I stopped expanding large CID ranges in W2 
> (vertical metrics) one by one into massive numbers of objects. Instead, the 
> range information is now stored in small primitive arrays. This avoids 
> creating large numbers of `Integer`, `Float`, and `Vector` objects and 
> prevents OutOfMemoryError (OOM).
>  * Changes
>  ** Added `Arrays` to the imports (used for array expansion).
>  ** Added new fields:
>  *** `vdRangeFirst` / `vdRangeLast` (`int[]`)
>  *** `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
>  **** These arrays store the values for each range entry (first..last).
>  ** Changes in `readVerticalDisplacements()`:
>  *** The existing array format (individual entries) is handled as before and 
> stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
>  *** When a range format entry (`first last w1y v1x v1y`) is encountered, 
> instead of looping from `first..last` and inserting each value into the 
> `HashMap`, the code now:
>  **** Expands the `vdRange*` arrays using `Arrays.copyOf`
>  **** Stores the range as a single entry in those arrays.
>  *** Purpose: prevent generating tens of thousands of objects for large 
> ranges (e.g., 16,000 entries).
>  ** Changes in the lookup logic (`getPositionVector` / 
> `getVerticalDisplacementVectorY`):
>  *** First check individual entries (`positionVectors` / 
> `verticalDisplacementY`).
>  **** If found, return that value.
>  *** Otherwise, linearly search the `vdRange*` arrays to see if the CID falls 
> within a stored range.
>  **** If a matching range is found, return the corresponding value.
>  *** If neither matches, return the default value.
>  ** Why this works (briefly)
>  *** Before: Receiving a range like `0..16000` caused the code to create 
> 16,001 `Integer`/`Float`/`Vector` objects and store them in a `HashMap`, 
> leading to large memory overhead.
>  *** Now: The same range is stored as a single range entry, reducing memory 
> usage to only a few dozen bytes per range.
>  
> {*}[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>  
> ]{*}([[https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java)]
>  
> {code:java}
> // code placeholder
> ### 
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>  
> ](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java)
> ```diff
> diff --git 
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>  
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> index 3b970edd24..5b574d8c66 100644
> --- 
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> +++ 
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> @@ -18,9 +18,7 @@ package org.apache.pdfbox.pdmodel.interactive.form;
>  
>  import java.awt.geom.GeneralPath;
>  import java.awt.geom.Rectangle2D;
> -
>  import java.io.IOException;
> -import java.lang.ref.SoftReference;
>  import java.util.ArrayList;
>  import java.util.Collections;
>  import java.util.HashMap;
> @@ -75,7 +73,7 @@ public final class PDAcroForm implements COSObjectable
>  
>      private ScriptingHandler scriptingHandler;
>  
> -    private final Map<COSName, SoftReference<PDFont>> directFontCache = new 
> HashMap<>();
> +    private final Map<COSName, PDFont> directFontCache = new HashMap<>();
>  
>      /**
>       * Constructor.``` {code}
>  
>  * What was changed
>  ** The type of `directFontCache` was changed:
>  *** Before: `Map<COSName, SoftReference<PDFont>>`
>  *** After: `Map<COSName, PDFont>`
>  ** As a result, the `SoftReference` import was removed, and the `get`/`put` 
> logic was rewritten to store and retrieve `PDFont` directly instead of going 
> through `SoftReference`.
>  * Why this was changed (problem description)
>  ** Previously, `SoftReference` was used so that the JVM could automatically 
> clear the referenced `PDFont` objects when memory became tight.
>  ** However, when the reference was cleared, the same font would be parsed 
> again the next time it was requested. Since font parsing is expensive, this 
> could happen repeatedly, causing excessive memory allocations.
>  ** In some cases this repeated parsing led to unnecessary memory pressure 
> and eventually an `OutOfMemoryError`.
>  ** By switching to strong references, the cache is no longer cleared 
> unpredictably by the GC, preventing this repeated parse cycle.
>  * Behavior after the change (effect)
>  ** The same embedded font will no longer be parsed multiple times within the 
> same page or the same `PDResources`.
>  ** This reduces unnecessary memory allocations and helps prevent 
> `OutOfMemoryError`.
>  ** Fonts are expected to be released according to the lifecycle of 
> `PDResources` / `PDDocument`, typically when page processing is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-6175) OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared + W2 range expansion leads to OOM

Reply via email to