[
https://issues.apache.org/jira/browse/PDFBOX-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kiyotsuki Suzuki updated PDFBOX-6175:
-------------------------------------
Issue Type: Improvement (was: Bug)
> OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared +
> W2 range expansion leads to OOM
> -------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6175
> URL: https://issues.apache.org/jira/browse/PDFBOX-6175
> Project: PDFBox
> Issue Type: Improvement
> Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
> Affects Versions: 3.0.5 PDFBox, 3.0.6 PDFBox, 3.0.7 PDFBox, 3.0.4 JBIG2
> Environment: openjdk 21.0.7 2025-04-15
> OpenJDK Runtime Environment Homebrew (build 21.0.7)
> OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)
> pdfbox version:3.0.4
> java -Xmx512M
> Reporter: Kiyotsuki Suzuki
> Priority: Major
> Labels: performance
> Fix For: 3.0.4 JBIG2
>
> Attachments: 6.pdf
>
>
> {color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs
> and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during
> font parsing / text extraction even with modest heap sizes. {color}
> *<Observed symptom>*
> - OOM occurs while creating PDFont / PDCIDFont instances during text
> extraction.
> - Problem appears when fonts are embedded as non-indirect objects and when
> W2 (vertical metrics) contains large ranges (e.g. first..last spanning many
> CIDs).
> * Likely root causes (two cooperating issues)
> 1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM
> clears soft references, causing cached font objects to be discarded.
> Subsequent uses re-parse the same (heavy) font repeatedly. This GC ->
> re-parse -> GC cycle can escalate memory usage and trigger OOM.
> 2. W2 range entries (first last w1y v.x v.y) are expanded naively into
> per-CID HashMap entries (boxed Integer/Float and Vector objects). A single
> large range (e.g. 0..16000) causes creation of tens of thousands of objects
> and large HashMap memory overhead, causing immediate heap exhaustion.
> * Suggested fixes (implementation-level guidance)
> -- Avoid relying on SoftReference for the per-resource direct font cache for
> non-indirect embedded fonts. Use strong references scoped to PDResources (or
> make the behavior configurable). PDResources is freed with the document
> lifecycle, so strong references prevent repeated re-parsing without leaking
> across documents.
> -- Do not expand large W2 ranges into individual boxed map entries. Parse
> and store W2 ranges compactly (e.g. range list with primitive arrays or small
> objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or
> use a compact index). This avoids creating thousands of Integer/Float/Vector
> objects for wide ranges.
> -- Add tests exercising large CID fonts with wide W2 ranges to guard against
> regressions, and add a memory-use test if possible.
> * Why fix should be upstream
> ** This is a parser/runtime efficiency bug that affects robustness for
> real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and
> large allocations across all users.
>
> *<Example>*
> {*}[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]{*}([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
> {code:java}
> // code placeholder
> ###
> [/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java](https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java)
> ```diff
> @@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
> // we don't actually read the complete table here because it can
> contain tens of thousands of glyphs
> // cache the relevant part of the font data so that the data stream
> can be closed if it is no longer needed
> byte[] dataBytes = data.read((int) getLength());
> + Runtime rt = Runtime.getRuntime();
> + System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs
> free=%6.1fMB used=%6.1fMB%n",
> + dataBytes.length / (1024.0 * 1024.0), numGlyphs,
> + rt.freeMemory() / 1024.0 / 1024.0,
> + (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
> try (RandomAccessReadBuffer read = new
> RandomAccessReadBuffer(dataBytes))
> {
> this.data = new RandomAccessReadDataStream(read);
> ``` {code}
> This is a log that prints the read size and the JVM memory state (free/used)
> immediately after loading the byte array of the GlyphTable, in order to
> visualize how much memory was consumed and whether the heap is becoming
> constrained.In this investigation, it helped confirm that “the heap decreases
> with each load and then recovers after GC,” which was useful for tracing the
> cause of the OutOfMemoryError (OOM).
>
> *[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java*
> ([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java|http://example.com/])
>
>
> {code:java}
> // code placeholder
> @@ -17,7 +17,6 @@
> package org.apache.pdfbox.pdmodel;
>
> import java.io.IOException;
> -import java.lang.ref.SoftReference;
> import java.util.Collections;
> import java.util.HashMap;
> import java.util.Map;
> @@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
> import
> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
> import org.apache.pdfbox.pdmodel.font.PDFont;
> import org.apache.pdfbox.pdmodel.font.PDFontFactory;
> +import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
> import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
> import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> +import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> import
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
> import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
> import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
> -import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> -import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
>
> /**
> * A set of resources available at the page/pages/stream level.
> @@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
>
> // PDFBOX-3442 cache fonts that are not indirect objects, as these
> aren't cached in ResourceCache
> // and this would result in huge memory footprint in text extraction
> - private final Map<COSName, SoftReference<PDFont>> directFontCache;
> + // NOTE: changed from SoftReference to strong reference to prevent GC
> clearing under memory pressure
> + // causing repeated re-parse of large CID fonts (death spiral leading to
> OOM)
> + private final Map<COSName, PDFont> directFontCache;
>
> /**
> * Constructor for embedding.
> @@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
> * @param directFontCache The document's direct font cache. Must be
> mutable
> */
> public PDResources(COSDictionary resourceDictionary, ResourceCache
> resourceCache,
> - Map<COSName, SoftReference<PDFont>> directFontCache)
> + Map<COSName, PDFont> directFontCache)
> {
> if (resourceDictionary == null)
> {
> @@ -152,14 +153,11 @@ public final class PDResources implements COSObjectable
> }
> else if (indirect == null)
> {
> - SoftReference<PDFont> ref = directFontCache.get(name);
> - if (ref != null)
> + System.out.println("Font " + name + " is not an indirect object,
> caching in directFontCache");
> + PDFont cached = directFontCache.get(name);
> + if (cached != null)
> {
> - PDFont cached = ref.get();
> - if (cached != null)
> - {
> - return cached;
> - }
> + return cached;
> }
> }
>
> @@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
> }
> else if (indirect == null)
> {
> - directFontCache.put(name, new SoftReference<>(font));
> + directFontCache.put(name, font);
> }
> return font;
> } {code}
> * What Was Changed (Key Code Changes)
> ** The `PDResources` field was changed from:
> *** Before: `Map<COSName, SoftReference<PDFont>> directFontCache`
> *** After: `Map<COSName, PDFont> directFontCache`
> ** The constructor parameter type was also updated accordingly:
> *** `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
> ** In `getFont(...)`, the retrieval logic was simplified:
> *** Before: Retrieved a `SoftReference` and then called `ref.get()` to
> obtain the `PDFont`.
> *** After: The `PDFont` is retrieved directly, so `ref.get()` is no longer
> needed.
> ** The caching logic was also updated:
> *** Before:`directFontCache.put(name, new SoftReference<>(font))`
> *** After: `directFontCache.put(name, font)`
> * Original Problem (Why the Fix Was Necessary)
> ** The original implementation stored fonts embedded directly as
> dictionaries using `SoftReference`.
> ** `SoftReference` allows the JVM to automatically clear cached objects when
> memory becomes low.
> ** This created a problem:
> *** 1. When memory pressure occurs, the JVM clears the cached fonts.
> *** 2. The next time the same font is requested, the system re-parses the
> font from the PDF.
> *** 3. Font parsing can allocate large temporary structures.
> *** 4. Under memory pressure, this leads to a loop:
> *** ```
> GC clears font cache
> → font requested again
> → font parsed again
> → large memory allocation
> → GC clears again
> → repeat
> ```
> *
> **
> *** This reparse loop can eventually cause an OutOfMemoryError (OOM). The
> issue becomes particularly severe with CID fonts (large CJK fonts), because
> parsing them creates very large in-memory structures.
> ** What This Fix Improves
> *** By switching `directFontCache` to strong references (`PDFont`), the JVM
> can no longer clear the cached fonts automatically.
> *** This prevents the cycle:
> **** ```
> memory pressure
> → cache cleared
> → font re-parsed
> → more memory pressure
> ```
> *
> **
> *** As a result, the system stops repeatedly parsing the same fonts,
> preventing unnecessary heap consumption and avoiding the OOM scenario.
>
> *[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java ]
> ([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java]
> )*
> {code:java}
> // code placeholder
> ### [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java
> ](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java)
> ```diff
> @@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
>
> import java.io.IOException;
> import java.io.InputStream;
> +import java.util.Arrays;
> import java.util.HashMap;
> import java.util.Map;
> +
> import org.apache.commons.logging.Log;
> import org.apache.commons.logging.LogFactory;
> -
> import org.apache.pdfbox.cos.COSArray;
> import org.apache.pdfbox.cos.COSBase;
> import org.apache.pdfbox.cos.COSDictionary;
> @@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable,
> PDFontLike, PDVectorFo
> private float defaultWidth;
> private float averageWidth;
>
> - private final Map<Integer, Float> verticalDisplacementY = new
> HashMap<>(); // w1y
> - private final Map<Integer, Vector> positionVectors = new HashMap<>();
> // v
> + private final Map<Integer, Float> verticalDisplacementY = new
> HashMap<>(); // w1y (individual entries)
> + private final Map<Integer, Vector> positionVectors = new HashMap<>();
> // v (individual entries)
> + // Range-based W2 entries stored as compact primitive arrays to avoid
> HashMap boxing overhead.
> + // A single range entry (first..last) replaces thousands of individual
> HashMap entries.
> + private int[] vdRangeFirst = new int[0];
> + private int[] vdRangeLast = new int[0];
> + private float[] vdRangeW1y = new float[0];
> + private float[] vdRangeVx = new float[0];
> + private float[] vdRangeVy = new float[0];
> private final float[] dw2 = new float[] { 880, -1000 };
>
> protected final COSDictionary dict;
> @@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable,
> PDFontLike, PDVectorFo
> {
> this.dict = fontDictionary;
> this.parent = parent;
> + Runtime rt = Runtime.getRuntime();
> + String fontName = fontDictionary.getNameAsString(COSName.BASE_FONT);
> + System.out.printf("[PDCIDFont] init %-40s free=%6.1fMB
> used=%6.1fMB%n",
> + fontName,
> + rt.freeMemory() / 1024.0 / 1024.0,
> + (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
> readWidths();
> + System.out.printf("[PDCIDFont] after readWidths %-32s
> widths.size=%d free=%6.1fMB%n",
> + fontName, widths.size(), rt.freeMemory() / 1024.0 / 1024.0);
> readVerticalDisplacements();
> + System.out.printf("[PDCIDFont] after readVD %-32s indiv=%d
> ranges=%d free=%6.1fMB%n",
> + fontName, verticalDisplacementY.size(), vdRangeFirst.length,
> + rt.freeMemory() / 1024.0 / 1024.0);
> }
>
> private void readWidths()
> @@ -180,11 +199,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> COSNumber w1y = (COSNumber) w2Array.getObject(++i);
> COSNumber v1x = (COSNumber) w2Array.getObject(++i);
> COSNumber v1y = (COSNumber) w2Array.getObject(++i);
> - for (int cid = first; cid <= last; cid++)
> - {
> - verticalDisplacementY.put(cid, w1y.floatValue());
> - positionVectors.put(cid, new
> Vector(v1x.floatValue(), v1y.floatValue()));
> - }
> + // Store as a compact range entry instead of expanding
> to per-CID HashMap entries.
> + // This avoids allocating thousands of boxed
> Integer/Float/Vector objects
> + // when a single range covers many CIDs (e.g. 0..16000).
> + int n = vdRangeFirst.length;
> + vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
> + vdRangeLast = Arrays.copyOf(vdRangeLast, n + 1);
> + vdRangeW1y = Arrays.copyOf(vdRangeW1y, n + 1);
> + vdRangeVx = Arrays.copyOf(vdRangeVx, n + 1);
> + vdRangeVy = Arrays.copyOf(vdRangeVy, n + 1);
> + vdRangeFirst[n] = first;
> + vdRangeLast[n] = last;
> + vdRangeW1y[n] = w1y.floatValue();
> + vdRangeVx[n] = v1x.floatValue();
> + vdRangeVy[n] = v1y.floatValue();
> }
> }
> }
> @@ -288,12 +317,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> public Vector getPositionVector(int code)
> {
> int cid = codeToCID(code);
> + // Check individual (array-format) entries first
> Vector v = positionVectors.get(cid);
> - if (v == null)
> + if (v != null)
> + {
> + return v;
> + }
> + // Check compact range entries
> + for (int i = 0; i < vdRangeFirst.length; i++)
> {
> - v = getDefaultPositionVector(cid);
> + if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> + {
> + return new Vector(vdRangeVx[i], vdRangeVy[i]);
> + }
> }
> - return v;
> + return getDefaultPositionVector(cid);
> }
>
> /**
> @@ -305,12 +343,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> public float getVerticalDisplacementVectorY(int code)
> {
> int cid = codeToCID(code);
> + // Check individual (array-format) entries first
> Float w1y = verticalDisplacementY.get(cid);
> - if (w1y == null)
> + if (w1y != null)
> {
> - w1y = dw2[1];
> + return w1y;
> + }
> + // Check compact range entries
> + for (int i = 0; i < vdRangeFirst.length; i++)
> + {
> + if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i])
> + {
> + return vdRangeW1y[i];
> + }
> }
> - return w1y;
> + return dw2[1];
> }
>
> @Override``` {code}
> To reduce memory consumption, I stopped expanding large CID ranges in W2
> (vertical metrics) one by one into massive numbers of objects. Instead, the
> range information is now stored in small primitive arrays. This avoids
> creating large numbers of `Integer`, `Float`, and `Vector` objects and
> prevents OutOfMemoryError (OOM).
> * Changes
> ** Added `Arrays` to the imports (used for array expansion).
> ** Added new fields:
> *** `vdRangeFirst` / `vdRangeLast` (`int[]`)
> *** `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
> **** These arrays store the values for each range entry (first..last).
> ** Changes in `readVerticalDisplacements()`:
> *** The existing array format (individual entries) is handled as before and
> stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
> *** When a range format entry (`first last w1y v1x v1y`) is encountered,
> instead of looping from `first..last` and inserting each value into the
> `HashMap`, the code now:
> **** Expands the `vdRange*` arrays using `Arrays.copyOf`
> **** Stores the range as a single entry in those arrays.
> *** Purpose: prevent generating tens of thousands of objects for large
> ranges (e.g., 16,000 entries).
> ** Changes in the lookup logic (`getPositionVector` /
> `getVerticalDisplacementVectorY`):
> *** First check individual entries (`positionVectors` /
> `verticalDisplacementY`).
> **** If found, return that value.
> *** Otherwise, linearly search the `vdRange*` arrays to see if the CID falls
> within a stored range.
> **** If a matching range is found, return the corresponding value.
> *** If neither matches, return the default value.
> ** Why this works (briefly)
> *** Before: Receiving a range like `0..16000` caused the code to create
> 16,001 `Integer`/`Float`/`Vector` objects and store them in a `HashMap`,
> leading to large memory overhead.
> *** Now: The same range is stored as a single range entry, reducing memory
> usage to only a few dozen bytes per range.
>
> {*}[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>
> ]{*}([[https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java)]
>
> {code:java}
> // code placeholder
> ###
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>
> ](https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java)
> ```diff
> diff --git
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> index 3b970edd24..5b574d8c66 100644
> ---
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> +++
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> @@ -18,9 +18,7 @@ package org.apache.pdfbox.pdmodel.interactive.form;
>
> import java.awt.geom.GeneralPath;
> import java.awt.geom.Rectangle2D;
> -
> import java.io.IOException;
> -import java.lang.ref.SoftReference;
> import java.util.ArrayList;
> import java.util.Collections;
> import java.util.HashMap;
> @@ -75,7 +73,7 @@ public final class PDAcroForm implements COSObjectable
>
> private ScriptingHandler scriptingHandler;
>
> - private final Map<COSName, SoftReference<PDFont>> directFontCache = new
> HashMap<>();
> + private final Map<COSName, PDFont> directFontCache = new HashMap<>();
>
> /**
> * Constructor.``` {code}
>
> * What was changed
> ** The type of `directFontCache` was changed:
> *** Before: `Map<COSName, SoftReference<PDFont>>`
> *** After: `Map<COSName, PDFont>`
> ** As a result, the `SoftReference` import was removed, and the `get`/`put`
> logic was rewritten to store and retrieve `PDFont` directly instead of going
> through `SoftReference`.
> * Why this was changed (problem description)
> ** Previously, `SoftReference` was used so that the JVM could automatically
> clear the referenced `PDFont` objects when memory became tight.
> ** However, when the reference was cleared, the same font would be parsed
> again the next time it was requested. Since font parsing is expensive, this
> could happen repeatedly, causing excessive memory allocations.
> ** In some cases this repeated parsing led to unnecessary memory pressure
> and eventually an `OutOfMemoryError`.
> ** By switching to strong references, the cache is no longer cleared
> unpredictably by the GC, preventing this repeated parse cycle.
> * Behavior after the change (effect)
> ** The same embedded font will no longer be parsed multiple times within the
> same page or the same `PDResources`.
> ** This reduces unnecessary memory allocations and helps prevent
> `OutOfMemoryError`.
> ** Fonts are expected to be released according to the lifecycle of
> `PDResources` / `PDDocument`, typically when page processing is finished.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]