[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137866#comment-17137866 ] Hudson commented on TIKA-3111: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #341 (See [https://builds.apache.org/job/tika-branch-1x/341/]) TIKA-3111 -- upgrade to PDFBox 2.0.20 -- need to understand (tallison: [https://github.com/apache/tika/commit/2b10d9c6ebf434fc4c57499acb591fb7226fee7d]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * (edit) tika-parsers/pom.xml > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135405#comment-17135405 ] Andreas Lehmkühler commented on TIKA-3111: -- Thanks for the prompt feedback [~tilman] > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135203#comment-17135203 ] Tilman Hausherr commented on TIKA-3111: --- Now it works > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135145#comment-17135145 ] Andreas Lehmkühler commented on TIKA-3111: -- I've extended my patch and taken LegacyPDFStreamEngine into account as well. The 4-parameter method now calls to deprecated 5-parameter method including a valid unicode value, see PDFBOX-4879 > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134780#comment-17134780 ] Andreas Lehmkühler commented on TIKA-3111: -- Thanks for the fast feedback and the inconvenience. I'm going back to work on that issue > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: Patch_PDFStreamEngine.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134768#comment-17134768 ] Tilman Hausherr commented on TIKA-3111: --- I did (after reverting my change in Tika), and it doesn't work: {{showGlyph(textRenderingMatrix, font, code, w);}} calls the method in {{LegacyPDFStreamEngine}}. > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: Patch_PDFStreamEngine.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134727#comment-17134727 ] Andreas Lehmkühler commented on TIKA-3111: -- I guess I've reinstated binary compatibility, see the attached patch. [~tilman]Are you able to double check the changes? > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: Patch_PDFStreamEngine.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134345#comment-17134345 ] Andreas Lehmkühler commented on TIKA-3111: -- [~tilman] Yes, you're right the contract is broken, my bad. I'm afraid we or need to repair that. I'm going to have a look. > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134303#comment-17134303 ] Tilman Hausherr commented on TIKA-3111: --- No, I got it to work with several changes in AbstractPDF2XHTML, i.e. use the 4 parameter call and get the unicode myself. [~lehmi] WDYT of this? IMHO the contract of the deprecated showGlyph() has been broken because now, unicode is null when called. {code} protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement) throws IOException { String unicode = font.toUnicode(code); super.showGlyph(textRenderingMatrix, font, code, displacement); if (unicode == null || unicode.isEmpty()) { unmappedUnicodeCharsPerPage++; } totalCharsPerPage++; } {code} > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134289#comment-17134289 ] Tim Allison commented on TIKA-3111: --- Thank you! So, we should switch to PDFStreamEngine from LegacyStreamEngine on Tika? > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134275#comment-17134275 ] Tilman Hausherr commented on TIKA-3111: --- Got it. PDFStreamEngine calls the (new) 4 parameter showGlyph. But not the Tika showGlyph() is called, the one from LegacyPDFStreamEngine is called so AbstractPDF2XHTML loses. > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134264#comment-17134264 ] Tilman Hausherr commented on TIKA-3111: --- Ignore my comment, it isn't helpful here, I was just displaying with PDFDebugger. Yes text extraction is fine. I'm researching something else now. PDFBox has changed the API of {{showGlyph}}. It should have been backward compatible, but as you mentioned, {{AbstractPDF2XHTML.showGlyph}} isn't called, that is very suspicious. > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134100#comment-17134100 ] Tim Allison commented on TIKA-3111: --- Sorry, to clarify, we don’t get character counts for _any_ pages in that file now. > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134097#comment-17134097 ] Tim Allison commented on TIKA-3111: --- Not sure I follow. Text extraction seems to be the same (on a quick look), and I recognize the file is broken. However, we used to get character counts for all of the pages, and we don’t now...oddly when I build on the command line but not in IntelliJ. If this is expected, is there a way we can get the character counts and unmapped character counts? > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133978#comment-17133978 ] Tilman Hausherr commented on TIKA-3111: --- tail of debug log for 2.0.19: {quote} Warning [PDType1Font] Using fallback font TimesNewRomanPSMT for MIonic Error [PDFStreamEngine] Operator Tm has too few operands: [COSInt{7}, COSInt{0}, COSInt{0}, COSInt{7}, COSFloat{1324270.4}] Warning [BaseParser] Corrupt object reference at offset 1568, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1575, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1580, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1584, start offset: 1561 Error [PDFStreamEngine] Operator Tm has too few operands: [COSInt{8}, COSInt{0}, COSInt{0}, COSInt{8}, COSFloat{132411.4}] Warning [BaseParser] Corrupt object reference at offset 1687, start offset: 1636 Warning [PDType1Font] Using fallback font TimesNewRomanPS-ItalicMT for MIonic-Italic Warning [PDSimpleFont] No Unicode mapping for .notdef (13) in font MIonic Warning [Type1Glyph2D] No glyph for code 13 (.notdef) in font MIonic Warning [PDSimpleFont] No Unicode mapping for .notdef (10) in font MIonic Warning [Type1Glyph2D] No glyph for code 10 (.notdef) in font MIonic Warning [PDSimpleFont] No Unicode mapping for emlowln (108) in font T-1 {quote} tail of debug log for 2.0.20: {quote} Warning [PDType1Font] Using fallback font TimesNewRomanPSMT for MIonic Error [PDFStreamEngine] Operator Tm has too few operands: [COSInt{7}, COSInt{0}, COSInt{0}, COSInt{7}, COSFloat{1324270.4}] Warning [BaseParser] Corrupt object reference at offset 1568, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1575, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1580, start offset: 1561 Warning [BaseParser] Corrupt object reference at offset 1584, start offset: 1561 Error [PDFStreamEngine] Operator Tm has too few operands: [COSInt{8}, COSInt{0}, COSInt{0}, COSInt{8}, COSFloat{132411.4}] Warning [BaseParser] Corrupt object reference at offset 1687, start offset: 1636 Warning [PDType1Font] Using fallback font TimesNewRomanPS-ItalicMT for MIonic-Italic Warning [Type1Glyph2D] No glyph for code 13 (.notdef) in font MIonic Warning [Type1Glyph2D] No glyph for code 10 (.notdef) in font MIonic {quote} > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133743#comment-17133743 ] Hudson commented on TIKA-3111: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1821 (See [https://builds.apache.org/job/Tika-trunk/1821/]) TIKA-3111 -- upgrade to PDFBox 2.0.20 -- need to understand (tallison: [https://github.com/apache/tika/commit/81bbd8b307b776e61bbe997e8bf6bd1bd1cedb13]) * (edit) tika-parsers/pom.xml * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133677#comment-17133677 ] Tim Allison commented on TIKA-3111: --- I made the upgrade in master, but came across a weird failure: https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1502 The unit test passed in my IDE, but caused the build to fail on the commandline -- {maven clean install}. It looks like showGlyph() is never being called when I try to build from the commandline > Upgrade to PDFBox 2.0.20 > > > Key: TIKA-3111 > URL: https://issues.apache.org/jira/browse/TIKA-3111 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)