[
https://issues.apache.org/jira/browse/PDFBOX-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kiyotsuki Suzuki updated PDFBOX-6175:
-------------------------------------
Description:
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during
font parsing / text extraction even with modest heap sizes. {color}
*<Observed symptom>*
- OOM occurs while creating PDFont / PDCIDFont instances during text
extraction.
- Problem appears when fonts are embedded as non-indirect objects and when W2
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).
* Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM
clears soft references, causing cached font objects to be discarded. Subsequent
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID
HashMap entries (boxed Integer/Float and Vector objects). A single large range
(e.g. 0..16000) causes creation of tens of thousands of objects and large
HashMap memory overhead, causing immediate heap exhaustion.
* Suggested fixes (implementation-level guidance)
-- Avoid relying on SoftReference for the per-resource direct font cache for
non-indirect embedded fonts. Use strong references scoped to PDResources (or
make the behavior configurable). PDResources is freed with the document
lifecycle, so strong references prevent repeated re-parsing without leaking
across documents.
-- Do not expand large W2 ranges into individual boxed map entries. Parse and
store W2 ranges compactly (e.g. range list with primitive arrays or small
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or
use a compact index). This avoids creating thousands of Integer/Float/Vector
objects for wide ranges.
-- Add tests exercising large CID fonts with wide W2 ranges to guard against
regressions, and add a memory-use test if possible.
* Why fix should be upstream
** This is a parser/runtime efficiency bug that affects robustness for
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and
large allocations across all users.
—
*<Example>*
[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
// we don't actually read the complete table here because it can
contain tens of thousands of glyphs
// cache the relevant part of the font data so that the data stream
can be closed if it is no longer needed
byte[] dataBytes = data.read((int) getLength());
+ Runtime rt = Runtime.getRuntime();
+ System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs
free=%6.1fMB used=%6.1fMB%n",
+ dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+ rt.freeMemory() / 1024.0 / 1024.0,
+ (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
try (RandomAccessReadBuffer read = new
RandomAccessReadBuffer(dataBytes))
{ this.data = new RandomAccessReadDataStream(read); ``` This is a
log that prints the read size and the JVM memory state (free/used) immediately
after loading the byte array of the GlyphTable, in order to visualize how much
memory was consumed and whether the heap is becoming constrained. In this
investigation, it helped confirm that “the heap decreases with each load and
then recovers after GC,” which was useful for tracing the cause of the
OutOfMemoryError (OOM). —
[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
```diff @@ -17,7 +17,6 @@ package org.apache.pdfbox.pdmodel; import
java.io.IOException; -import java.lang.ref.SoftReference; import
java.util.Collections; import java.util.HashMap; import java.util.Map; @@
-31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable; import
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
import org.apache.pdfbox.pdmodel.font.PDFont; import
org.apache.pdfbox.pdmodel.font.PDFontFactory; +import
org.apache.pdfbox.pdmodel.graphics.PDXObject; +import
org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace; import
org.apache.pdfbox.pdmodel.graphics.color.PDPattern; import
org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; +import
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; import
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace; import
org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern; import
org.apache.pdfbox.pdmodel.graphics.shading.PDShading; -import
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; -import
org.apache.pdfbox.pdmodel.graphics.PDXObject; +import
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState; /** * A
set of resources available at the page/pages/stream level. @@ -54,7 +53,9 @@
public final class PDResources implements COSObjectable //
PDFBOX-3442 cache fonts that are not indirect objects, as these aren't cached
in ResourceCache // and this would result in huge memory footprint in text
extraction - private final Map<COSName, SoftReference<PDFont>>
directFontCache; + // NOTE: changed from SoftReference to strong reference
to prevent GC clearing under memory pressure + // causing repeated re-parse
of large CID fonts (death spiral leading to OOM) + private final
Map<COSName, PDFont> directFontCache; /** * Constructor for
embedding. @@ -107,7 +108,7 @@ public final class PDResources implements
COSObjectable * @param directFontCache The document's direct font cache.
Must be mutable */ public PDResources(COSDictionary
resourceDictionary, ResourceCache resourceCache, - Map<COSName,
SoftReference<PDFont>> directFontCache) + Map<COSName, PDFont>
directFontCache) \\{ if (resourceDictionary == null)
\{ @@ -152,14 +153,11 @@ public final class PDResources implements
COSObjectable }
else if (indirect == null)
{ - SoftReference<PDFont> ref = directFontCache.get(name); -
if (ref != null) + System.out.println("Font " + name + " is not
an indirect object, caching in directFontCache"); + PDFont cached =
directFontCache.get(name); + if (cached != null) \\{ -
PDFont cached = ref.get(); - if (cached != null)
- \{ - return cached; - }
+ return cached;
}
}
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
}
else if (indirect == null)
{ - directFontCache.put(name, new SoftReference<>(font)); +
directFontCache.put(name, font); }
return font;
}
```
#
## What Was Changed (Key Code Changes)
* The `PDResources` field was changed from:
* {*}{{*}}Before:{{*}}{*}
`Map<COSName, SoftReference<PDFont>> directFontCache`
* {*}{{*}}After:{{*}}{*}
`Map<COSName, PDFont> directFontCache`
* The constructor parameter type was also updated accordingly:
`Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
* In `getFont(...)`, the retrieval logic was simplified:
* {*}{{*}}Before:{{*}}{*}
Retrieved a `SoftReference` and then called `ref.get()` to obtain the
`PDFont`.
* {*}{{*}}After:{{*}}{*}
The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
* The caching logic was also updated:
* {*}{{*}}Before:{{*}}{*}
`directFontCache.put(name, new SoftReference<>(font))`
* {*}{{*}}After:{{*}}{*}
`directFontCache.put(name, font)`
* Additionally, a debug message was added:
```java
System.out.println("Font " + name + " is not an indirect object, caching in
directFontCache");
```
—
Original Problem (Why the Fix Was Necessary)
The original implementation stored {*}{{*}}fonts embedded directly as
dictionaries{{*}}{*} using `SoftReference`.
`SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects
when memory becomes low{{*}}{*}.
This created a problem:
1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system {*}{{*}}re-parses the
font from the PDF{{*}}{*}.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:
```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```
This {*}{{*}}reparse loop{{*}}{*} can eventually cause an
{*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK
fonts){{*}}{*}, because parsing them creates very large in-memory structures.
—
What This Fix Improves
By switching `directFontCache` to {*}{{*}}strong references (`PDFont`){{*}}{*},
the JVM can no longer clear the cached fonts automatically.
This prevents the cycle:
```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```
As a result, the system {*}{{*}}stops repeatedly parsing the same
fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM
scenario.
—
[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
import java.io.IOException;
import java.io.InputStream;
+import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
+
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
-
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
private float defaultWidth;
private float averageWidth;
- private final Map<Integer, Float> verticalDisplacementY = new
HashMap<>(); // w1y
- private final Map<Integer, Vector> positionVectors = new HashMap<>();
// v
+ private final Map<Integer, Float> verticalDisplacementY = new HashMap<>();
// w1y (individual entries)
+ private final Map<Integer, Vector> positionVectors = new HashMap<>();
// v (individual entries)
+ // Range-based W2 entries stored as compact primitive arrays to avoid
HashMap boxing overhead.
+ // A single range entry (first..last) replaces thousands of individual
HashMap entries.
+ private int[] vdRangeFirst = new int[0];
+ private int[] vdRangeLast = new int[0];
+ private float[] vdRangeW1y = new float[0];
+ private float[] vdRangeVx = new float[0];
+ private float[] vdRangeVy = new float[0];
private final float[] dw2 = new float[] \{ 880, -1000 };
protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
\{ this.dict = fontDictionary; this.parent = parent; +
Runtime rt = Runtime.getRuntime(); + String fontName =
fontDictionary.getNameAsString(COSName.BASE_FONT); +
System.out.printf("[PDCIDFont] init %-40s free=%6.1fMB used=%6.1fMB%n", +
fontName, + rt.freeMemory() / 1024.0 / 1024.0, +
(rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0); readWidths();
+ System.out.printf("[PDCIDFont] after readWidths %-32s widths.size=%d
free=%6.1fMB%n", + fontName, widths.size(), rt.freeMemory() /
1024.0 / 1024.0); readVerticalDisplacements(); +
System.out.printf("[PDCIDFont] after readVD %-32s indiv=%d ranges=%d
free=%6.1fMB%n", + fontName, verticalDisplacementY.size(),
vdRangeFirst.length, + rt.freeMemory() / 1024.0 / 1024.0); }
private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
COSNumber w1y = (COSNumber) w2Array.getObject(++i);
COSNumber v1x = (COSNumber) w2Array.getObject(++i);
COSNumber v1y = (COSNumber) w2Array.getObject(++i);
- for (int cid = first; cid <= last; cid++)
- \{ -
verticalDisplacementY.put(cid, w1y.floatValue()); -
positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -
}+ // Store as a compact range entry instead
of expanding to per-CID HashMap entries.
+ // This avoids allocating thousands of boxed
Integer/Float/Vector objects
+ // when a single range covers many CIDs (e.g. 0..16000).
+ int n = vdRangeFirst.length;
+ vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+ vdRangeLast = Arrays.copyOf(vdRangeLast, n + 1);
+ vdRangeW1y = Arrays.copyOf(vdRangeW1y, n + 1);
+ vdRangeVx = Arrays.copyOf(vdRangeVx, n + 1);
+ vdRangeVy = Arrays.copyOf(vdRangeVy, n + 1);
+ vdRangeFirst[n] = first;
+ vdRangeLast[n] = last;
+ vdRangeW1y[n] = w1y.floatValue();
+ vdRangeVx[n] = v1x.floatValue();
+ vdRangeVy[n] = v1y.floatValue();
}
}
}
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
public Vector getPositionVector(int code)
\{ int cid = codeToCID(code); + // Check individual
(array-format) entries first Vector v = positionVectors.get(cid); -
if (v == null) + if (v != null) + { + return v; +
}
+ // Check compact range entries
+ for (int i = 0; i < vdRangeFirst.length; i++)
Unknown macro: \{ - v = getDefaultPositionVector(cid); +
if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i]) + \{ +
return new Vector(vdRangeVx[i], vdRangeVy[i]); + } }
- return v;
+ return getDefaultPositionVector(cid);
}
/**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
public float getVerticalDisplacementVectorY(int code)
{ int cid = codeToCID(code); + // Check individual
(array-format) entries first Float w1y =
verticalDisplacementY.get(cid); - if (w1y == null) + if (w1y !=
null) { - w1y = dw2[1]; + return w1y; + }
+ // Check compact range entries
+ for (int i = 0; i < vdRangeFirst.length; i++)
+
Unknown macro: {+ if (cid >= vdRangeFirst[i] && cid <=
vdRangeLast[i])+ \{ + return vdRangeW1y[i]; +
}
}
- return w1y;
+ return dw2[1];
}
@Override
```
To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in
W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects.
Instead, the range information is now stored in {*}{{*}}small primitive
arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and
`Vector` objects and prevents {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
* Changes
** Added `Arrays` to the imports (used for array expansion).
*
** Added new fields:
* `vdRangeFirst` / `vdRangeLast` (`int[]`)
* `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
These arrays store the values for each {*}{{*}}range entry
(first..last){{*}}{*}.
* Added memory status logging in the constructor:
* Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and
{*}{{*}}after `readVD`{{*}}{*}.
* Changes in `readVerticalDisplacements()`:
* The existing {*}{{*}}array format (individual entries){{*}}{*} is handled
as before and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
* When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is
encountered, instead of looping from `first..last` and inserting each value
into the `HashMap`, the code now:
* Expands the `vdRange*` arrays using `Arrays.copyOf`
* Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
* Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*}
for large ranges (e.g., 16,000 entries).
* Changes in the lookup logic (`getPositionVector` /
`getVerticalDisplacementVectorY`):
1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` /
`verticalDisplacementY`).
* If found, return that value.
2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see if
the CID falls within a stored range.
* If a matching range is found, return the corresponding value.
3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
#
## Why this works (briefly)
* {*}{{*}}Before:{{*}}{*}
Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001
`Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`,
leading to large memory overhead.
* {*}{{*}}Now:{{*}}{*}
The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing
memory usage to {*}{{*}}only a few dozen bytes per range{{*}}{*}.
—
[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
```diff
diff --git
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
index 3b970edd24..5b574d8c66 100644
- private final Map<COSName, SoftReference<PDFont>> directFontCache = new
HashMap<>();
+ private final Map<COSName, PDFont> directFontCache = new HashMap<>();
/**
* Constructor.
```
* {*}{{*}}What was changed{{*}}{*}
* The type of `directFontCache` was changed:
* {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
* {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
* As a result, the `SoftReference` import was removed, and the `get`/`put`
logic was rewritten to store and retrieve `PDFont` directly instead of going
through `SoftReference`.
* {*}{{*}}Why this was changed (problem description){{*}}{*}
* Previously, `SoftReference` was used so that the JVM could automatically
clear the referenced `PDFont` objects when memory became tight.
* However, when the reference was cleared, the same font would be parsed
again the next time it was requested. Since font parsing is expensive, this
could happen repeatedly, causing excessive memory allocations.
* In some cases this repeated parsing led to unnecessary memory pressure and
eventually an `OutOfMemoryError`.
* By switching to strong references, the cache is no longer cleared
unpredictably by the GC, preventing this repeated parse cycle.
* {*}{{*}}Behavior after the change (effect){{*}}{*}
* The same embedded font will no longer be parsed multiple times within the
same page or the same `PDResources`.
* This reduces unnecessary memory allocations and helps prevent
`OutOfMemoryError`.
* Fonts are expected to be released according to the lifecycle of
`PDResources` / `PDDocument`, typically when page processing is finished.
was:
{color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs
and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during
font parsing / text extraction even with modest heap sizes. {color}
*<Observed symptom>*
- OOM occurs while creating PDFont / PDCIDFont instances during text
extraction.
- Problem appears when fonts are embedded as non-indirect objects and when W2
(vertical metrics) contains large ranges (e.g. first..last spanning many CIDs).
* Likely root causes (two cooperating issues)
1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM
clears soft references, causing cached font objects to be discarded. Subsequent
uses re-parse the same (heavy) font repeatedly. This GC -> re-parse -> GC cycle
can escalate memory usage and trigger OOM.
2. W2 range entries (first last w1y v.x v.y) are expanded naively into per-CID
HashMap entries (boxed Integer/Float and Vector objects). A single large range
(e.g. 0..16000) causes creation of tens of thousands of objects and large
HashMap memory overhead, causing immediate heap exhaustion.
* Suggested fixes (implementation-level guidance)
-- Avoid relying on SoftReference for the per-resource direct font cache for
non-indirect embedded fonts. Use strong references scoped to PDResources (or
make the behavior configurable). PDResources is freed with the document
lifecycle, so strong references prevent repeated re-parsing without leaking
across documents.
-- Do not expand large W2 ranges into individual boxed map entries. Parse and
store W2 ranges compactly (e.g. range list with primitive arrays or small
objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or
use a compact index). This avoids creating thousands of Integer/Float/Vector
objects for wide ranges.
-- Add tests exercising large CID fonts with wide W2 ranges to guard against
regressions, and add a memory-use test if possible.
* Why fix should be upstream
*
-- This is a parser/runtime efficiency bug that affects robustness for
real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and
large allocations across all users.
—
*<Example>*
[/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
```diff
@@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
// we don't actually read the complete table here because it can
contain tens of thousands of glyphs
// cache the relevant part of the font data so that the data stream
can be closed if it is no longer needed
byte[] dataBytes = data.read((int) getLength());
+ Runtime rt = Runtime.getRuntime();
+ System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs
free=%6.1fMB used=%6.1fMB%n",
+ dataBytes.length / (1024.0 * 1024.0), numGlyphs,
+ rt.freeMemory() / 1024.0 / 1024.0,
+ (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
try (RandomAccessReadBuffer read = new
RandomAccessReadBuffer(dataBytes))
{
this.data = new RandomAccessReadDataStream(read);
```
This is a log that prints the read size and the JVM memory state (free/used)
immediately after loading the byte array of the GlyphTable, in order to
visualize how much memory was consumed and whether the heap is becoming
constrained.
In this investigation, it helped confirm that “the heap decreases with each
load and then recovers after GC,” which was useful for tracing the cause of the
OutOfMemoryError (OOM).
—
[/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
```diff
@@ -17,7 +17,6 @@
package org.apache.pdfbox.pdmodel;
import java.io.IOException;
-import java.lang.ref.SoftReference;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
@@ -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
import
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
/**
* A set of resources available at the page/pages/stream level.
@@ -54,7 +53,9 @@ public final class PDResources implements COSObjectable
// PDFBOX-3442 cache fonts that are not indirect objects, as these aren't
cached in ResourceCache
// and this would result in huge memory footprint in text extraction
- private final Map<COSName, SoftReference<PDFont>> directFontCache;
+ // NOTE: changed from SoftReference to strong reference to prevent GC
clearing under memory pressure
+ // causing repeated re-parse of large CID fonts (death spiral leading to
OOM)
+ private final Map<COSName, PDFont> directFontCache;
/**
* Constructor for embedding.
@@ -107,7 +108,7 @@ public final class PDResources implements COSObjectable
* @param directFontCache The document's direct font cache. Must be mutable
*/
public PDResources(COSDictionary resourceDictionary, ResourceCache
resourceCache,
- Map<COSName, SoftReference<PDFont>> directFontCache)
+ Map<COSName, PDFont> directFontCache)
\{ if (resourceDictionary == null) \{ @@ -152,14
+153,11 @@ public final class PDResources implements COSObjectable }
else if (indirect == null)
{
- SoftReference<PDFont> ref = directFontCache.get(name);
- if (ref != null)
+ System.out.println("Font " + name + " is not an indirect object,
caching in directFontCache");
+ PDFont cached = directFontCache.get(name);
+ if (cached != null)
\{ - PDFont cached = ref.get(); -
if (cached != null) - \{ - return cached; -
}
+ return cached;
}
}
@@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
}
else if (indirect == null)
{ - directFontCache.put(name, new SoftReference<>(font)); +
directFontCache.put(name, font); }
return font;
}
```
#
## What Was Changed (Key Code Changes)
* The `PDResources` field was changed from:
* {*}{{*}}Before:{{*}}{*}
`Map<COSName, SoftReference<PDFont>> directFontCache`
* {*}{{*}}After:{{*}}{*}
`Map<COSName, PDFont> directFontCache`
* The constructor parameter type was also updated accordingly:
`Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
* In `getFont(...)`, the retrieval logic was simplified:
* {*}{{*}}Before:{{*}}{*}
Retrieved a `SoftReference` and then called `ref.get()` to obtain the
`PDFont`.
* {*}{{*}}After:{{*}}{*}
The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
* The caching logic was also updated:
* {*}{{*}}Before:{{*}}{*}
`directFontCache.put(name, new SoftReference<>(font))`
* {*}{{*}}After:{{*}}{*}
`directFontCache.put(name, font)`
* Additionally, a debug message was added:
```java
System.out.println("Font " + name + " is not an indirect object, caching in
directFontCache");
```
—
Original Problem (Why the Fix Was Necessary)
The original implementation stored {*}{{*}}fonts embedded directly as
dictionaries{{*}}{*} using `SoftReference`.
`SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects
when memory becomes low{{*}}{*}.
This created a problem:
1. When memory pressure occurs, the JVM clears the cached fonts.
2. The next time the same font is requested, the system {*}{{*}}re-parses the
font from the PDF{{*}}{*}.
3. Font parsing can allocate large temporary structures.
4. Under memory pressure, this leads to a loop:
```
GC clears font cache
→ font requested again
→ font parsed again
→ large memory allocation
→ GC clears again
→ repeat
```
This {*}{{*}}reparse loop{{*}}{*} can eventually cause an
{*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK
fonts){{*}}{*}, because parsing them creates very large in-memory structures.
—
What This Fix Improves
By switching `directFontCache` to {*}{{*}}strong references (`PDFont`){{*}}{*},
the JVM can no longer clear the cached fonts automatically.
This prevents the cycle:
```
memory pressure
→ cache cleared
→ font re-parsed
→ more memory pressure
```
As a result, the system {*}{{*}}stops repeatedly parsing the same
fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM
scenario.
—
[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
```diff
@@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
import java.io.IOException;
import java.io.InputStream;
+import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
+
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
-
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSDictionary;
@@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
private float defaultWidth;
private float averageWidth;
- private final Map<Integer, Float> verticalDisplacementY = new
HashMap<>(); // w1y
- private final Map<Integer, Vector> positionVectors = new HashMap<>();
// v
+ private final Map<Integer, Float> verticalDisplacementY = new HashMap<>();
// w1y (individual entries)
+ private final Map<Integer, Vector> positionVectors = new HashMap<>();
// v (individual entries)
+ // Range-based W2 entries stored as compact primitive arrays to avoid
HashMap boxing overhead.
+ // A single range entry (first..last) replaces thousands of individual
HashMap entries.
+ private int[] vdRangeFirst = new int[0];
+ private int[] vdRangeLast = new int[0];
+ private float[] vdRangeW1y = new float[0];
+ private float[] vdRangeVx = new float[0];
+ private float[] vdRangeVy = new float[0];
private final float[] dw2 = new float[] \{ 880, -1000 };
protected final COSDictionary dict;
@@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
{ this.dict = fontDictionary; this.parent = parent; +
Runtime rt = Runtime.getRuntime(); + String fontName =
fontDictionary.getNameAsString(COSName.BASE_FONT); +
System.out.printf("[PDCIDFont] init %-40s free=%6.1fMB used=%6.1fMB%n", +
fontName, + rt.freeMemory() / 1024.0 / 1024.0, +
(rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0); readWidths();
+ System.out.printf("[PDCIDFont] after readWidths %-32s widths.size=%d
free=%6.1fMB%n", + fontName, widths.size(), rt.freeMemory() /
1024.0 / 1024.0); readVerticalDisplacements(); +
System.out.printf("[PDCIDFont] after readVD %-32s indiv=%d ranges=%d
free=%6.1fMB%n", + fontName, verticalDisplacementY.size(),
vdRangeFirst.length, + rt.freeMemory() / 1024.0 / 1024.0); }
private void readWidths()
@@ -180,11 +199,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
COSNumber w1y = (COSNumber) w2Array.getObject(++i);
COSNumber v1x = (COSNumber) w2Array.getObject(++i);
COSNumber v1y = (COSNumber) w2Array.getObject(++i);
- for (int cid = first; cid <= last; cid++)
- \{ -
verticalDisplacementY.put(cid, w1y.floatValue()); -
positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -
}+ // Store as a compact range entry instead
of expanding to per-CID HashMap entries.
+ // This avoids allocating thousands of boxed
Integer/Float/Vector objects
+ // when a single range covers many CIDs (e.g. 0..16000).
+ int n = vdRangeFirst.length;
+ vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
+ vdRangeLast = Arrays.copyOf(vdRangeLast, n + 1);
+ vdRangeW1y = Arrays.copyOf(vdRangeW1y, n + 1);
+ vdRangeVx = Arrays.copyOf(vdRangeVx, n + 1);
+ vdRangeVy = Arrays.copyOf(vdRangeVy, n + 1);
+ vdRangeFirst[n] = first;
+ vdRangeLast[n] = last;
+ vdRangeW1y[n] = w1y.floatValue();
+ vdRangeVx[n] = v1x.floatValue();
+ vdRangeVy[n] = v1y.floatValue();
}
}
}
@@ -288,12 +317,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
public Vector getPositionVector(int code)
\{ int cid = codeToCID(code); + // Check individual
(array-format) entries first Vector v = positionVectors.get(cid); -
if (v == null) + if (v != null) + \{ + return v;
+ }
+ // Check compact range entries
+ for (int i = 0; i < vdRangeFirst.length; i++)
{ - v = getDefaultPositionVector(cid); + if (cid >=
vdRangeFirst[i] && cid <= vdRangeLast[i]) + \\{ +
return new Vector(vdRangeVx[i], vdRangeVy[i]); + }
}
- return v;
+ return getDefaultPositionVector(cid);
}
/**
@@ -305,12 +343,21 @@ public abstract class PDCIDFont implements COSObjectable,
PDFontLike, PDVectorFo
public float getVerticalDisplacementVectorY(int code)
\{ int cid = codeToCID(code); + // Check individual
(array-format) entries first Float w1y =
verticalDisplacementY.get(cid); - if (w1y == null) + if (w1y !=
null) \{ - w1y = dw2[1]; + return w1y; +
}
+ // Check compact range entries
+ for (int i = 0; i < vdRangeFirst.length; i++)
+
Unknown macro: {+ if (cid >= vdRangeFirst[i] && cid <=
vdRangeLast[i])+
{ + return vdRangeW1y[i]; + }
}
- return w1y;
+ return dw2[1];
}
@Override
```
To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in
W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects.
Instead, the range information is now stored in {*}{{*}}small primitive
arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and
`Vector` objects and prevents *{*}OutOfMemoryError (OOM){*}*.
* Changes
** Added `Arrays` to the imports (used for array expansion).
*
** Added new fields:
* `vdRangeFirst` / `vdRangeLast` (`int[]`)
* `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
These arrays store the values for each {*}{{*}}range entry
(first..last){{*}}{*}.
* Added memory status logging in the constructor:
* Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and
{*}{{*}}after `readVD`{{*}}{*}.
* Changes in `readVerticalDisplacements()`:
* The existing {*}{{*}}array format (individual entries){{*}}{*} is handled
as before and stored in `HashMap`s (`verticalDisplacementY`, `positionVectors`).
* When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is
encountered, instead of looping from `first..last` and inserting each value
into the `HashMap`, the code now:
* Expands the `vdRange*` arrays using `Arrays.copyOf`
* Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
* Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*}
for large ranges (e.g., 16,000 entries).
* Changes in the lookup logic (`getPositionVector` /
`getVerticalDisplacementVectorY`):
1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` /
`verticalDisplacementY`).
* If found, return that value.
2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see if
the CID falls within a stored range.
* If a matching range is found, return the corresponding value.
3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
#
## Why this works (briefly)
* {*}{{*}}Before:{{*}}{*}
Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001
`Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`,
leading to large memory overhead.
* {*}{{*}}Now:{{*}}{*}
The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing
memory usage to *{*}only a few dozen bytes per range{*}*.
—
[pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
```diff
diff --git
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
index 3b970edd24..5b574d8c66 100644
- private final Map<COSName, SoftReference<PDFont>> directFontCache = new
HashMap<>();
+ private final Map<COSName, PDFont> directFontCache = new HashMap<>();
/**
* Constructor.
```
* {*}{{*}}What was changed{{*}}{*}
* The type of `directFontCache` was changed:
* {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
* {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
* As a result, the `SoftReference` import was removed, and the `get`/`put`
logic was rewritten to store and retrieve `PDFont` directly instead of going
through `SoftReference`.
* {*}{{*}}Why this was changed (problem description){{*}}{*}
* Previously, `SoftReference` was used so that the JVM could automatically
clear the referenced `PDFont` objects when memory became tight.
* However, when the reference was cleared, the same font would be parsed
again the next time it was requested. Since font parsing is expensive, this
could happen repeatedly, causing excessive memory allocations.
* In some cases this repeated parsing led to unnecessary memory pressure and
eventually an `OutOfMemoryError`.
* By switching to strong references, the cache is no longer cleared
unpredictably by the GC, preventing this repeated parse cycle.
* {*}{{*}}Behavior after the change (effect){{*}}{*}
* The same embedded font will no longer be parsed multiple times within the
same page or the same `PDResources`.
* This reduces unnecessary memory allocations and helps prevent
`OutOfMemoryError`.
* Fonts are expected to be released according to the lifecycle of
`PDResources` / `PDDocument`, typically when page processing is finished.
> OutOfMemoryError parsing large CID fonts: soft-reference font cache cleared +
> W2 range expansion leads to OOM
> -------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6175
> URL: https://issues.apache.org/jira/browse/PDFBOX-6175
> Project: PDFBox
> Issue Type: Bug
> Components: AcroForm, FontBox, Parsing, PDModel, Text extraction
> Affects Versions: 3.0.5 PDFBox, 3.0.6 PDFBox, 3.0.7 PDFBox, 3.0.4 JBIG2
> Environment: openjdk 21.0.7 2025-04-15
> OpenJDK Runtime Environment Homebrew (build 21.0.7)
> OpenJDK 64-Bit Server VM Homebrew (build 21.0.7, mixed mode, sharing)
> pdfbox version:3.0.4
> Reporter: Kiyotsuki Suzuki
> Priority: Major
> Labels: performance
> Fix For: 3.0.4 JBIG2
>
> Attachments: 6.pdf
>
>
> {color:#172b4d}When processing PDFs that contain large CID fonts (many CIDs
> and/or wide W2 ranges), PDFBox can run into java.lang.OutOfMemoryError during
> font parsing / text extraction even with modest heap sizes. {color}
> *<Observed symptom>*
> - OOM occurs while creating PDFont / PDCIDFont instances during text
> extraction.
> - Problem appears when fonts are embedded as non-indirect objects and when
> W2 (vertical metrics) contains large ranges (e.g. first..last spanning many
> CIDs).
> * Likely root causes (two cooperating issues)
> 1. directFontCache uses SoftReference<PDFont>. Under memory pressure the JVM
> clears soft references, causing cached font objects to be discarded.
> Subsequent uses re-parse the same (heavy) font repeatedly. This GC ->
> re-parse -> GC cycle can escalate memory usage and trigger OOM.
> 2. W2 range entries (first last w1y v.x v.y) are expanded naively into
> per-CID HashMap entries (boxed Integer/Float and Vector objects). A single
> large range (e.g. 0..16000) causes creation of tens of thousands of objects
> and large HashMap memory overhead, causing immediate heap exhaustion.
> * Suggested fixes (implementation-level guidance)
> -- Avoid relying on SoftReference for the per-resource direct font cache for
> non-indirect embedded fonts. Use strong references scoped to PDResources (or
> make the behavior configurable). PDResources is freed with the document
> lifecycle, so strong references prevent repeated re-parsing without leaking
> across documents.
> -- Do not expand large W2 ranges into individual boxed map entries. Parse
> and store W2 ranges compactly (e.g. range list with primitive arrays or small
> objects representing [first,last,w1y,vx,vy]). At lookup time check ranges (or
> use a compact index). This avoids creating thousands of Integer/Float/Vector
> objects for wide ranges.
> -- Add tests exercising large CID fonts with wide W2 ranges to guard against
> regressions, and add a memory-use test if possible.
> * Why fix should be upstream
> ** This is a parser/runtime efficiency bug that affects robustness for
> real-world PDFs (CJK/CID fonts). Upstream fix avoids repeated re-parsing and
> large allocations across all users.
> —
> *<Example>*
> [/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java]([https://github.com/apache/pdfbox/blob/3.0.4/fontbox/src/main/java/org/apache/fontbox/ttf/GlyphTable.java])
> ```diff
> @@ -80,6 +80,11 @@ public class GlyphTable extends TTFTable
> // we don't actually read the complete table here because it can
> contain tens of thousands of glyphs
> // cache the relevant part of the font data so that the data stream
> can be closed if it is no longer needed
> byte[] dataBytes = data.read((int) getLength());
> + Runtime rt = Runtime.getRuntime();
> + System.out.printf("[GlyphTable] read %6.2f MB for %d glyphs
> free=%6.1fMB used=%6.1fMB%n",
> + dataBytes.length / (1024.0 * 1024.0), numGlyphs,
> + rt.freeMemory() / 1024.0 / 1024.0,
> + (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
> try (RandomAccessReadBuffer read = new
> RandomAccessReadBuffer(dataBytes))
>
> { this.data = new RandomAccessReadDataStream(read); ``` This is
> a log that prints the read size and the JVM memory state (free/used)
> immediately after loading the byte array of the GlyphTable, in order to
> visualize how much memory was consumed and whether the heap is becoming
> constrained. In this investigation, it helped confirm that “the heap
> decreases with each load and then recovers after GC,” which was useful for
> tracing the cause of the OutOfMemoryError (OOM). —
> [/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java])
> ```diff @@ -17,7 +17,6 @@ package org.apache.pdfbox.pdmodel; import
> java.io.IOException; -import java.lang.ref.SoftReference; import
> java.util.Collections; import java.util.HashMap; import java.util.Map; @@
> -31,15 +30,15 @@ import org.apache.pdfbox.pdmodel.common.COSObjectable;
> import
> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
> import org.apache.pdfbox.pdmodel.font.PDFont; import
> org.apache.pdfbox.pdmodel.font.PDFontFactory; +import
> org.apache.pdfbox.pdmodel.graphics.PDXObject; +import
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace; import
> org.apache.pdfbox.pdmodel.graphics.color.PDPattern; import
> org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; +import
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; import
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace; import
> org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern; import
> org.apache.pdfbox.pdmodel.graphics.shading.PDShading; -import
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; -import
> org.apache.pdfbox.pdmodel.graphics.PDXObject; +import
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState; /** *
> A set of resources available at the page/pages/stream level. @@ -54,7 +53,9
> @@ public final class PDResources implements COSObjectable //
> PDFBOX-3442 cache fonts that are not indirect objects, as these aren't cached
> in ResourceCache // and this would result in huge memory footprint in
> text extraction - private final Map<COSName, SoftReference<PDFont>>
> directFontCache; + // NOTE: changed from SoftReference to strong reference
> to prevent GC clearing under memory pressure + // causing repeated
> re-parse of large CID fonts (death spiral leading to OOM) + private final
> Map<COSName, PDFont> directFontCache; /** * Constructor for
> embedding. @@ -107,7 +108,7 @@ public final class PDResources implements
> COSObjectable * @param directFontCache The document's direct font
> cache. Must be mutable */ public PDResources(COSDictionary
> resourceDictionary, ResourceCache resourceCache, - Map<COSName,
> SoftReference<PDFont>> directFontCache) + Map<COSName, PDFont>
> directFontCache) \\{ if (resourceDictionary == null)
> \{ @@ -152,14 +153,11 @@ public final class PDResources implements
> COSObjectable }
> else if (indirect == null)
>
> { - SoftReference<PDFont> ref = directFontCache.get(name); -
> if (ref != null) + System.out.println("Font " + name + " is
> not an indirect object, caching in directFontCache"); + PDFont
> cached = directFontCache.get(name); + if (cached != null)
> \\{ - PDFont cached = ref.get(); - if
> (cached != null) - \{ - return cached; -
> }
> + return cached;
> }
> }
>
> @@ -176,7 +174,7 @@ public final class PDResources implements COSObjectable
> }
> else if (indirect == null)
>
> { - directFontCache.put(name, new SoftReference<>(font)); +
> directFontCache.put(name, font); }
> return font;
> }
> ```
> #
> ## What Was Changed (Key Code Changes)
> * The `PDResources` field was changed from:
> * {*}{{*}}Before:{{*}}{*}
> `Map<COSName, SoftReference<PDFont>> directFontCache`
> * {*}{{*}}After:{{*}}{*}
> `Map<COSName, PDFont> directFontCache`
> * The constructor parameter type was also updated accordingly:
> `Map<COSName, SoftReference<PDFont>>` → `Map<COSName, PDFont>`.
> * In `getFont(...)`, the retrieval logic was simplified:
> * {*}{{*}}Before:{{*}}{*}
> Retrieved a `SoftReference` and then called `ref.get()` to obtain the
> `PDFont`.
> * {*}{{*}}After:{{*}}{*}
> The `PDFont` is retrieved directly, so `ref.get()` is no longer needed.
> * The caching logic was also updated:
> * {*}{{*}}Before:{{*}}{*}
> `directFontCache.put(name, new SoftReference<>(font))`
> * {*}{{*}}After:{{*}}{*}
> `directFontCache.put(name, font)`
> * Additionally, a debug message was added:
> ```java
> System.out.println("Font " + name + " is not an indirect object, caching in
> directFontCache");
> ```
> —
> Original Problem (Why the Fix Was Necessary)
> The original implementation stored {*}{{*}}fonts embedded directly as
> dictionaries{{*}}{*} using `SoftReference`.
> `SoftReference` allows the JVM to {*}{{*}}automatically clear cached objects
> when memory becomes low{{*}}{*}.
> This created a problem:
> 1. When memory pressure occurs, the JVM clears the cached fonts.
> 2. The next time the same font is requested, the system {*}{{*}}re-parses the
> font from the PDF{{*}}{*}.
> 3. Font parsing can allocate large temporary structures.
> 4. Under memory pressure, this leads to a loop:
> ```
> GC clears font cache
> → font requested again
> → font parsed again
> → large memory allocation
> → GC clears again
> → repeat
> ```
> This {*}{{*}}reparse loop{{*}}{*} can eventually cause an
> {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
> The issue becomes particularly severe with {*}{{*}}CID fonts (large CJK
> fonts){{*}}{*}, because parsing them creates very large in-memory structures.
> —
> What This Fix Improves
> By switching `directFontCache` to {*}{{*}}strong references
> (`PDFont`){{*}}{*}, the JVM can no longer clear the cached fonts
> automatically.
> This prevents the cycle:
> ```
> memory pressure
> → cache cleared
> → font re-parsed
> → more memory pressure
> ```
> As a result, the system {*}{{*}}stops repeatedly parsing the same
> fonts{{*}}{*}, preventing unnecessary heap consumption and avoiding the OOM
> scenario.
> —
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java
> ]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDCIDFont.java])
> ```diff
> @@ -18,11 +18,12 @@ package org.apache.pdfbox.pdmodel.font;
>
> import java.io.IOException;
> import java.io.InputStream;
> +import java.util.Arrays;
> import java.util.HashMap;
> import java.util.Map;
> +
> import org.apache.commons.logging.Log;
> import org.apache.commons.logging.LogFactory;
> -
> import org.apache.pdfbox.cos.COSArray;
> import org.apache.pdfbox.cos.COSBase;
> import org.apache.pdfbox.cos.COSDictionary;
> @@ -51,8 +52,15 @@ public abstract class PDCIDFont implements COSObjectable,
> PDFontLike, PDVectorFo
> private float defaultWidth;
> private float averageWidth;
>
> - private final Map<Integer, Float> verticalDisplacementY = new
> HashMap<>(); // w1y
> - private final Map<Integer, Vector> positionVectors = new HashMap<>();
> // v
> + private final Map<Integer, Float> verticalDisplacementY = new
> HashMap<>(); // w1y (individual entries)
> + private final Map<Integer, Vector> positionVectors = new HashMap<>();
> // v (individual entries)
> + // Range-based W2 entries stored as compact primitive arrays to avoid
> HashMap boxing overhead.
> + // A single range entry (first..last) replaces thousands of individual
> HashMap entries.
> + private int[] vdRangeFirst = new int[0];
> + private int[] vdRangeLast = new int[0];
> + private float[] vdRangeW1y = new float[0];
> + private float[] vdRangeVx = new float[0];
> + private float[] vdRangeVy = new float[0];
> private final float[] dw2 = new float[] \{ 880, -1000 };
>
> protected final COSDictionary dict;
> @@ -67,8 +75,19 @@ public abstract class PDCIDFont implements COSObjectable,
> PDFontLike, PDVectorFo
> \{ this.dict = fontDictionary; this.parent = parent;
> + Runtime rt = Runtime.getRuntime(); + String fontName =
> fontDictionary.getNameAsString(COSName.BASE_FONT); +
> System.out.printf("[PDCIDFont] init %-40s free=%6.1fMB used=%6.1fMB%n", +
> fontName, + rt.freeMemory() / 1024.0 / 1024.0, +
> (rt.totalMemory() - rt.freeMemory()) / 1024.0 / 1024.0);
> readWidths(); + System.out.printf("[PDCIDFont] after readWidths %-32s
> widths.size=%d free=%6.1fMB%n", + fontName, widths.size(),
> rt.freeMemory() / 1024.0 / 1024.0); readVerticalDisplacements(); +
> System.out.printf("[PDCIDFont] after readVD %-32s indiv=%d
> ranges=%d free=%6.1fMB%n", + fontName,
> verticalDisplacementY.size(), vdRangeFirst.length, +
> rt.freeMemory() / 1024.0 / 1024.0); }
>
> private void readWidths()
> @@ -180,11 +199,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> COSNumber w1y = (COSNumber) w2Array.getObject(++i);
> COSNumber v1x = (COSNumber) w2Array.getObject(++i);
> COSNumber v1y = (COSNumber) w2Array.getObject(++i);
> - for (int cid = first; cid <= last; cid++)
> - \{ -
> verticalDisplacementY.put(cid, w1y.floatValue()); -
> positionVectors.put(cid, new Vector(v1x.floatValue(), v1y.floatValue())); -
> }+ // Store as a compact range entry
> instead of expanding to per-CID HashMap entries.
> + // This avoids allocating thousands of boxed
> Integer/Float/Vector objects
> + // when a single range covers many CIDs (e.g. 0..16000).
> + int n = vdRangeFirst.length;
> + vdRangeFirst = Arrays.copyOf(vdRangeFirst, n + 1);
> + vdRangeLast = Arrays.copyOf(vdRangeLast, n + 1);
> + vdRangeW1y = Arrays.copyOf(vdRangeW1y, n + 1);
> + vdRangeVx = Arrays.copyOf(vdRangeVx, n + 1);
> + vdRangeVy = Arrays.copyOf(vdRangeVy, n + 1);
> + vdRangeFirst[n] = first;
> + vdRangeLast[n] = last;
> + vdRangeW1y[n] = w1y.floatValue();
> + vdRangeVx[n] = v1x.floatValue();
> + vdRangeVy[n] = v1y.floatValue();
> }
> }
> }
> @@ -288,12 +317,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> public Vector getPositionVector(int code)
> \{ int cid = codeToCID(code); + // Check individual
> (array-format) entries first Vector v = positionVectors.get(cid); -
> if (v == null) + if (v != null) + { + return
> v; + }
> + // Check compact range entries
> + for (int i = 0; i < vdRangeFirst.length; i++)
>
> Unknown macro: \{ - v = getDefaultPositionVector(cid); +
> if (cid >= vdRangeFirst[i] && cid <= vdRangeLast[i]) + \{ +
> return new Vector(vdRangeVx[i], vdRangeVy[i]); + }
> }
> - return v;
> + return getDefaultPositionVector(cid);
> }
>
> /**
> @@ -305,12 +343,21 @@ public abstract class PDCIDFont implements
> COSObjectable, PDFontLike, PDVectorFo
> public float getVerticalDisplacementVectorY(int code)
> { int cid = codeToCID(code); + // Check individual
> (array-format) entries first Float w1y =
> verticalDisplacementY.get(cid); - if (w1y == null) + if (w1y !=
> null) { - w1y = dw2[1]; + return w1y; +
> }
> + // Check compact range entries
> + for (int i = 0; i < vdRangeFirst.length; i++)
> +
> Unknown macro: {+ if (cid >= vdRangeFirst[i] && cid <=
> vdRangeLast[i])+ \{ + return vdRangeW1y[i]; +
> }
> }
> - return w1y;
> + return dw2[1];
> }
>
> @Override
> ```
> To reduce memory consumption, I stopped expanding {*}{{*}}large CID ranges in
> W2 (vertical metrics){{*}}{*} one by one into massive numbers of objects.
> Instead, the range information is now stored in {*}{{*}}small primitive
> arrays{*}{{*}}. This avoids creating large numbers of `Integer`, `Float`, and
> `Vector` objects and prevents {*}{{*}}OutOfMemoryError (OOM){{*}}{*}.
> * Changes
> ** Added `Arrays` to the imports (used for array expansion).
> *
> ** Added new fields:
> * `vdRangeFirst` / `vdRangeLast` (`int[]`)
> * `vdRangeW1y` / `vdRangeVx` / `vdRangeVy` (`float[]`)
> These arrays store the values for each {*}{{*}}range entry
> (first..last){{*}}{*}.
> * Added memory status logging in the constructor:
> * Logs free/used memory {*}{{*}}before and after `readWidths`{{*}}{*} and
> {*}{{*}}after `readVD`{{*}}{*}.
> * Changes in `readVerticalDisplacements()`:
> * The existing {*}{{*}}array format (individual entries){{*}}{*} is handled
> as before and stored in `HashMap`s (`verticalDisplacementY`,
> `positionVectors`).
> * When a {*}{{*}}range format{{*}}{*} entry (`first last w1y v1x v1y`) is
> encountered, instead of looping from `first..last` and inserting each value
> into the `HashMap`, the code now:
> * Expands the `vdRange*` arrays using `Arrays.copyOf`
> * Stores the {*}{{*}}range as a single entry{{*}}{*} in those arrays.
> * Purpose: prevent generating {*}{{*}}tens of thousands of objects{{*}}{*}
> for large ranges (e.g., 16,000 entries).
> * Changes in the lookup logic (`getPositionVector` /
> `getVerticalDisplacementVectorY`):
> 1. First check {*}{{*}}individual entries{{*}}{*} (`positionVectors` /
> `verticalDisplacementY`).
> * If found, return that value.
> 2. Otherwise, {*}{{*}}linearly search{{*}}{*} the `vdRange*` arrays to see
> if the CID falls within a stored range.
> * If a matching range is found, return the corresponding value.
> 3. If neither matches, return the {*}{{*}}default value{{*}}{*}.
> #
> ## Why this works (briefly)
> * {*}{{*}}Before:{{*}}{*}
> Receiving a range like `0..16000` caused the code to create {*}{{*}}16,001
> `Integer`/`Float`/`Vector` objects{{*}}{*} and store them in a `HashMap`,
> leading to large memory overhead.
> * {*}{{*}}Now:{{*}}{*}
> The same range is stored as {*}{{*}}a single range entry{*}{{*}}, reducing
> memory usage to {*}{{*}}only a few dozen bytes per range{{*}}{*}.
> —
> [pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>
> ]([https://github.com/apache/pdfbox/blob/3.0.4/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java])
> ```diff
> diff --git
> a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
>
> b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAcroForm.java
> index 3b970edd24..5b574d8c66 100644
> - private final Map<COSName, SoftReference<PDFont>> directFontCache = new
> HashMap<>();
> + private final Map<COSName, PDFont> directFontCache = new HashMap<>();
>
> /**
> * Constructor.
> ```
> * {*}{{*}}What was changed{{*}}{*}
> * The type of `directFontCache` was changed:
> * {*}{{*}}Before:{{*}}{*} `Map<COSName, SoftReference<PDFont>>`
> * {*}{{*}}After:{{*}}{*} `Map<COSName, PDFont>`
> * As a result, the `SoftReference` import was removed, and the `get`/`put`
> logic was rewritten to store and retrieve `PDFont` directly instead of going
> through `SoftReference`.
> * {*}{{*}}Why this was changed (problem description){{*}}{*}
> * Previously, `SoftReference` was used so that the JVM could automatically
> clear the referenced `PDFont` objects when memory became tight.
> * However, when the reference was cleared, the same font would be parsed
> again the next time it was requested. Since font parsing is expensive, this
> could happen repeatedly, causing excessive memory allocations.
> * In some cases this repeated parsing led to unnecessary memory pressure
> and eventually an `OutOfMemoryError`.
> * By switching to strong references, the cache is no longer cleared
> unpredictably by the GC, preventing this repeated parse cycle.
> * {*}{{*}}Behavior after the change (effect){{*}}{*}
> * The same embedded font will no longer be parsed multiple times within the
> same page or the same `PDResources`.
> * This reduces unnecessary memory allocations and helps prevent
> `OutOfMemoryError`.
> * Fonts are expected to be released according to the lifecycle of
> `PDResources` / `PDDocument`, typically when page processing is finished.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]