[
https://issues.apache.org/jira/browse/PDFBOX-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
james updated PDFBOX-6146:
--------------------------
Description:
I have a pdf file which causes an OutOfMemory error when trying to extract the
text. Unfortunately, the file is a customer file which i cannot share (it is
only about 5mb). I am willing to work with someone, however, on debugging the
issue. Before it fails with OOME, i get the following errors:
{{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable -
scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset =
10088 ()}}
{{16:45:23.419 [main] WARN o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord
array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
The above output is from running on 3.0.6, which seems to have improved
something relevant. This was originally failing on 3.0.5, in which case i got
around 40k log entries like:
{{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords
not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
I've tried allocating up to 6gb to the extraction process without any success.
I can open the pdf file with the macos preview without any issues.
The relevant stack trace is:
{{java.lang.OutOfMemoryError: Java heap space
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409)
org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186)
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165)
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123)
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72)
org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97)
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
{{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
{{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
{{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
was:
I have a pdf file which causes an OutOfMemory error when trying to extract the
text. Unfortunately, the file is a customer file which i cannot share. I am
willing to work with someone, however, on debugging the issue. Before it fails
with OOME, i get the following errors:
{{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable -
scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset =
10088 ()}}
{{16:45:23.419 [main] WARN o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord
array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
The above output is from running on 3.0.6, which seems to have improved
something relevant. This was originally failing on 3.0.5, in which case i got
around 40k log entries like:
{{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords
not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
I've tried allocating up to 6gb to the extraction process without any success.
I can open the pdf file with the macos preview without any issues.
The relevant stack trace is:
{{java.lang.OutOfMemoryError: Java heap space
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409)
org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186)
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165)
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123)
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72)
org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97)
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
{{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
{{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
{{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
> OutOfMemoryError when trying to extract text from pdf
> -----------------------------------------------------
>
> Key: PDFBOX-6146
> URL: https://issues.apache.org/jira/browse/PDFBOX-6146
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 3.0.6 PDFBox
> Environment: java 17. macos 26.2
> Reporter: james
> Priority: Blocker
>
> I have a pdf file which causes an OutOfMemory error when trying to extract
> the text. Unfortunately, the file is a customer file which i cannot share
> (it is only about 5mb). I am willing to work with someone, however, on
> debugging the issue. Before it fails with OOME, i get the following errors:
> {{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable -
> scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset =
> 10088 ()}}
> {{16:45:23.419 [main] WARN o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord
> array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
> The above output is from running on 3.0.6, which seems to have improved
> something relevant. This was originally failing on 3.0.5, in which case i
> got around 40k log entries like:
> {{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords
> not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
> I've tried allocating up to 6gb to the extraction process without any
> success. I can open the pdf file with the macos preview without any issues.
> The relevant stack trace is:
> {{java.lang.OutOfMemoryError: Java heap space
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
>
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
>
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409)
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186)
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165)
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123)
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72)
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
> org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97)
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
> {{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
> {{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
> {{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]