[ 
https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417510#comment-16417510
 ] 

Hudson commented on TIKA-2579:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1459 (See 
[https://builds.apache.org/job/Tika-trunk/1459/])
 TIKA-2579 and TIKA-2607: Upgrade PDFBox to 2.0.9 and include new (tallison: 
[https://github.com/apache/tika/commit/ee9e4f445dc8801fe69b5d7702c27aecbf9a6efd])
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) CHANGES.txt
* (edit) tika-parsers/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java


> Update to PDFBox 2.0.9 when available
> -------------------------------------
>
>                 Key: TIKA-2579
>                 URL: https://issues.apache.org/jira/browse/TIKA-2579
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: David Pilato
>            Assignee: Tim Allison
>            Priority: Major
>
> Hey team
>  
> We got this report in elasticsearch ingest attachment project: 
> [https://github.com/elastic/elasticsearch/issues/27198]
> Basically when a font is not available PDFBox is throwing an exception like
> {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] 
> [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 
> [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when 
> reading table cmap java.io.IOException: CMap subtype 14 not yet implemented 
> at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
>  at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
>  at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at 
> org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
>  at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
>  at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>  at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>  at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) 
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
> org.apache.tika.Tika.parseToString(Tika.java:537)}}
> This might have been solved by PDFParser with 
> https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in 
> PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue 
> https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will 
> actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could 
> be useful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to