[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137866#comment-17137866
 ] 

Hudson commented on TIKA-3111:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #341 (See 
[https://builds.apache.org/job/tika-branch-1x/341/])
TIKA-3111 -- upgrade to PDFBox 2.0.20 -- need to understand (tallison: 
[https://github.com/apache/tika/commit/2b10d9c6ebf434fc4c57499acb591fb7226fee7d])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) tika-parsers/pom.xml


> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135405#comment-17135405
 ] 

Andreas Lehmkühler commented on TIKA-3111:
--

Thanks for the prompt feedback [~tilman]

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135203#comment-17135203
 ] 

Tilman Hausherr commented on TIKA-3111:
---

Now it works

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135145#comment-17135145
 ] 

Andreas Lehmkühler commented on TIKA-3111:
--

I've extended my patch and taken LegacyPDFStreamEngine into account as well. 
The 4-parameter method now calls to deprecated 5-parameter method including a 
valid unicode value, see PDFBOX-4879

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134780#comment-17134780
 ] 

Andreas Lehmkühler commented on TIKA-3111:
--

Thanks for the fast feedback and the inconvenience. I'm going back to work on 
that issue

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: Patch_PDFStreamEngine.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134768#comment-17134768
 ] 

Tilman Hausherr commented on TIKA-3111:
---

I did (after reverting my change in Tika), and it doesn't work: 
{{showGlyph(textRenderingMatrix, font, code, w);}} calls the method in 
{{LegacyPDFStreamEngine}}.

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: Patch_PDFStreamEngine.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134727#comment-17134727
 ] 

Andreas Lehmkühler commented on TIKA-3111:
--

I guess I've reinstated binary compatibility, see the attached patch.

[~tilman]Are you able to double check the changes?

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: Patch_PDFStreamEngine.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134345#comment-17134345
 ] 

Andreas Lehmkühler commented on TIKA-3111:
--

[~tilman] Yes, you're right the contract is broken, my bad. I'm afraid we or 
need to repair that. I'm going to have a look.

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134303#comment-17134303
 ] 

Tilman Hausherr commented on TIKA-3111:
---

No, I got it to work with several changes in AbstractPDF2XHTML, i.e. use the 4 
parameter call and get the unicode myself.

[~lehmi] WDYT of this? IMHO the contract of the deprecated showGlyph() has been 
broken because now, unicode is null when called.

{code}
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, 
Vector displacement) throws IOException
{
String unicode = font.toUnicode(code);
super.showGlyph(textRenderingMatrix, font, code, displacement);
if (unicode == null || unicode.isEmpty()) {
unmappedUnicodeCharsPerPage++;
}
totalCharsPerPage++;
}
{code}


> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134289#comment-17134289
 ] 

Tim Allison commented on TIKA-3111:
---

Thank you!  So, we should switch to PDFStreamEngine from LegacyStreamEngine on 
Tika?

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134275#comment-17134275
 ] 

Tilman Hausherr commented on TIKA-3111:
---

Got it. PDFStreamEngine calls the (new) 4 parameter showGlyph. But not the Tika 
showGlyph() is called, the one from LegacyPDFStreamEngine is called so 
AbstractPDF2XHTML loses.

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134264#comment-17134264
 ] 

Tilman Hausherr commented on TIKA-3111:
---

Ignore my comment, it isn't helpful here, I was just displaying with 
PDFDebugger. Yes text extraction is fine.

I'm researching something else now. PDFBox has changed the API of 
{{showGlyph}}. It should have been backward compatible, but as you mentioned, 
{{AbstractPDF2XHTML.showGlyph}} isn't called, that is very suspicious.

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134100#comment-17134100
 ] 

Tim Allison commented on TIKA-3111:
---

Sorry, to clarify, we don’t get character counts for _any_ pages in that file 
now.

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134097#comment-17134097
 ] 

Tim Allison commented on TIKA-3111:
---

Not sure I follow.

Text extraction seems to be the same (on a quick look), and I recognize the 
file is broken. However, we used to get character counts for all of the pages, 
and we don’t now...oddly when I build on the command line but not in IntelliJ.

If this is expected, is there a way we can get the character counts and 
unmapped character counts?

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133978#comment-17133978
 ] 

Tilman Hausherr commented on TIKA-3111:
---

tail of debug log for 2.0.19:
{quote}
Warning  [PDType1Font] Using fallback font TimesNewRomanPSMT for MIonic
 Error  [PDFStreamEngine] Operator Tm has too few operands: [COSInt{7}, 
COSInt{0}, COSInt{0}, COSInt{7}, COSFloat{1324270.4}]
 Warning  [BaseParser] Corrupt object reference at offset 1568, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1575, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1580, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1584, start offset: 
1561
 Error  [PDFStreamEngine] Operator Tm has too few operands: [COSInt{8}, 
COSInt{0}, COSInt{0}, COSInt{8}, COSFloat{132411.4}]
 Warning  [BaseParser] Corrupt object reference at offset 1687, start offset: 
1636
 Warning  [PDType1Font] Using fallback font TimesNewRomanPS-ItalicMT for 
MIonic-Italic
 Warning  [PDSimpleFont] No Unicode mapping for .notdef (13) in font MIonic
 Warning  [Type1Glyph2D] No glyph for code 13 (.notdef) in font MIonic
 Warning  [PDSimpleFont] No Unicode mapping for .notdef (10) in font MIonic
 Warning  [Type1Glyph2D] No glyph for code 10 (.notdef) in font MIonic
 Warning  [PDSimpleFont] No Unicode mapping for emlowln (108) in font T-1
{quote}

tail of debug log for 2.0.20:
{quote}
 Warning  [PDType1Font] Using fallback font TimesNewRomanPSMT for MIonic
 Error  [PDFStreamEngine] Operator Tm has too few operands: [COSInt{7}, 
COSInt{0}, COSInt{0}, COSInt{7}, COSFloat{1324270.4}]
 Warning  [BaseParser] Corrupt object reference at offset 1568, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1575, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1580, start offset: 
1561
 Warning  [BaseParser] Corrupt object reference at offset 1584, start offset: 
1561
 Error  [PDFStreamEngine] Operator Tm has too few operands: [COSInt{8}, 
COSInt{0}, COSInt{0}, COSInt{8}, COSFloat{132411.4}]
 Warning  [BaseParser] Corrupt object reference at offset 1687, start offset: 
1636
 Warning  [PDType1Font] Using fallback font TimesNewRomanPS-ItalicMT for 
MIonic-Italic
 Warning  [Type1Glyph2D] No glyph for code 13 (.notdef) in font MIonic
 Warning  [Type1Glyph2D] No glyph for code 10 (.notdef) in font MIonic
{quote}


> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133743#comment-17133743
 ] 

Hudson commented on TIKA-3111:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1821 (See 
[https://builds.apache.org/job/Tika-trunk/1821/])
TIKA-3111 -- upgrade to PDFBox 2.0.20 -- need to understand (tallison: 
[https://github.com/apache/tika/commit/81bbd8b307b776e61bbe997e8bf6bd1bd1cedb13])
* (edit) tika-parsers/pom.xml
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133677#comment-17133677
 ] 

Tim Allison commented on TIKA-3111:
---

I made the upgrade in master, but came across a weird failure: 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1502

The unit test passed in my IDE, but caused the build to fail on the commandline 
-- {maven clean install}.  It looks like showGlyph() is never being called when 
I try to build from the commandline

> Upgrade to PDFBox 2.0.20
> 
>
> Key: TIKA-3111
> URL: https://issues.apache.org/jira/browse/TIKA-3111
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)