james created PDFBOX-6146:
-----------------------------

             Summary: OutOfMemoryError when trying to extract text from pdf
                 Key: PDFBOX-6146
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6146
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 3.0.6 PDFBox
         Environment: java 17.  macos 26.2
            Reporter: james


I have a pdf file which causes an OutOfMemory error when trying to extract the 
text.  Unfortunately, the file is a customer file which i cannot share.  I am 
willing to work with someone, however, on debugging the issue.  Before it fails 
with OOME, i get the following errors:

{{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - 
scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset = 
10088 ()}}
{{16:45:23.419 [main] WARN  o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord 
array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}

I've tried allocating up to 6gb to the extraction process without any success.  
I can open the pdf file with the macos preview without any issues.

The relevant stack trace is:

{{{}java.lang.OutOfMemoryError: Java heap spaceat 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)at
 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)at
 
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
 org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409)at 
org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186)at 
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165)at 
org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)at 
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123)at 
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72)at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)at
 org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97)at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)at
 org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)at
 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)at
 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)at
 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517){}}}{{{}at
 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158){}}}{{{}at
 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153){}}}{{{}at
 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380){}}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to