tsdb extraction

2018-03-28 Thread Oleg Tikhonov
Hi guys,
I am wondering if we have a parser which can deal with time series, like
influxDB or Prometheus?

May be you know such "work in progress" - it's also good.

Thanks in advance,
Oleg


[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418040#comment-16418040
 ] 

Hudson commented on TIKA-2618:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #12 (See 
[https://builds.apache.org/job/tika-branch-1x/12/])
TIKA-2618 -- avoid overwriting labels (tallison: 
[https://github.com/apache/tika/commit/ca9c2f53048e84a6c483165ba7779f8cb6393ec7])
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_labels-govdocs-515858.xls
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java


> LabelRecord and LabelSSTRecord text can be overwritten in xls
> -
>
> Key: TIKA-2618
> URL: https://issues.apache.org/jira/browse/TIKA-2618
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> In our regression tests, we've lost small amounts of text from quite a few 
> xls (standalone, but especially embedded).  This is somewhat caused by 
> removing the {{listenForAllRecords=true}} that I accidentally left in as part 
> of debugging something a while ago. When that is true, we don't cache the 
> records in currentSheet, so they are added to the {{extraTextCells}} list.  
> When that is false, which is now the default, the {{LabelRecord}} and 
> {{LabelSSTRecord}} are sometimes being overwritten because multiple cells can 
> have the same x/y coordinates in the {{currentSheet}} map.
> When {{listenForAllRecords=false}}, we're trying to listen for labels, but 
> we're often overwriting them because of the map.
> Let's add labels to {{extraTextCells}} so that at least the text is processed.
> As one example: "africa" in govdocs1/199/199294.ppt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418034#comment-16418034
 ] 

Hudson commented on TIKA-2618:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1461 (See 
[https://builds.apache.org/job/Tika-trunk/1461/])
TIKA-2618 -- avoid overwriting labels (tallison: 
[https://github.com/apache/tika/commit/7a9b17f478c867c7df5516b4ebb2ce3bf8b0aa36])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_labels-govdocs-515858.xls


> LabelRecord and LabelSSTRecord text can be overwritten in xls
> -
>
> Key: TIKA-2618
> URL: https://issues.apache.org/jira/browse/TIKA-2618
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> In our regression tests, we've lost small amounts of text from quite a few 
> xls (standalone, but especially embedded).  This is somewhat caused by 
> removing the {{listenForAllRecords=true}} that I accidentally left in as part 
> of debugging something a while ago. When that is true, we don't cache the 
> records in currentSheet, so they are added to the {{extraTextCells}} list.  
> When that is false, which is now the default, the {{LabelRecord}} and 
> {{LabelSSTRecord}} are sometimes being overwritten because multiple cells can 
> have the same x/y coordinates in the {{currentSheet}} map.
> When {{listenForAllRecords=false}}, we're trying to listen for labels, but 
> we're often overwriting them because of the map.
> Let's add labels to {{extraTextCells}} so that at least the text is processed.
> As one example: "africa" in govdocs1/199/199294.ppt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: 1.18 pre rc regression tests

2018-03-28 Thread Allison, Timothy B.
Still waiting for reports...

We've had quite a few files go from application/x-123 to image/x-tga via 
TIKA-2527.

I think this is expected because they all appear to be embedded files, with 
file names that end in .tga. But I wanted to confirm this is expected.

There's also one example of: application/x-stata-dta -> image/x-tga, which is 
probably wrong:

http://162.242.228.174/docs/commoncrawl2_likely_broken/BT/BTTVHEUDLE7WODDGPYT6LLA6LXMHS3CX.dta
 



-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, March 28, 2018 10:55 AM
To: dev@tika.apache.org
Subject: 1.18 pre rc regression tests

All,
I've run the initial regression tests.  The corpus size is now big enough that 
I have to migrate the H2 tables to postgres before writing the reports.  I'll 
post the reports as soon as they're finally ready, but I'm starting to go 
through some results now.

Cheers,

Tim



[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417998#comment-16417998
 ] 

Hudson commented on TIKA-2618:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #224 (See 
[https://builds.apache.org/job/tika-2.x-windows/224/])
TIKA-2618 -- avoid overwriting labels (tallison: rev 
7a9b17f478c867c7df5516b4ebb2ce3bf8b0aa36)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_labels-govdocs-515858.xls


> LabelRecord and LabelSSTRecord text can be overwritten in xls
> -
>
> Key: TIKA-2618
> URL: https://issues.apache.org/jira/browse/TIKA-2618
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> In our regression tests, we've lost small amounts of text from quite a few 
> xls (standalone, but especially embedded).  This is somewhat caused by 
> removing the {{listenForAllRecords=true}} that I accidentally left in as part 
> of debugging something a while ago. When that is true, we don't cache the 
> records in currentSheet, so they are added to the {{extraTextCells}} list.  
> When that is false, which is now the default, the {{LabelRecord}} and 
> {{LabelSSTRecord}} are sometimes being overwritten because multiple cells can 
> have the same x/y coordinates in the {{currentSheet}} map.
> When {{listenForAllRecords=false}}, we're trying to listen for labels, but 
> we're often overwriting them because of the map.
> Let's add labels to {{extraTextCells}} so that at least the text is processed.
> As one example: "africa" in govdocs1/199/199294.ppt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2618.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> LabelRecord and LabelSSTRecord text can be overwritten in xls
> -
>
> Key: TIKA-2618
> URL: https://issues.apache.org/jira/browse/TIKA-2618
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> In our regression tests, we've lost small amounts of text from quite a few 
> xls (standalone, but especially embedded).  This is somewhat caused by 
> removing the {{listenForAllRecords=true}} that I accidentally left in as part 
> of debugging something a while ago. When that is true, we don't cache the 
> records in currentSheet, so they are added to the {{extraTextCells}} list.  
> When that is false, which is now the default, the {{LabelRecord}} and 
> {{LabelSSTRecord}} are sometimes being overwritten because multiple cells can 
> have the same x/y coordinates in the {{currentSheet}} map.
> When {{listenForAllRecords=false}}, we're trying to listen for labels, but 
> we're often overwriting them because of the map.
> Let's add labels to {{extraTextCells}} so that at least the text is processed.
> As one example: "africa" in govdocs1/199/199294.ppt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2618:
-

 Summary: LabelRecord and LabelSSTRecord text can be overwritten in 
xls
 Key: TIKA-2618
 URL: https://issues.apache.org/jira/browse/TIKA-2618
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


In our regression tests, we've lost small amounts of text from quite a few xls 
(standalone, but especially embedded).  This is somewhat caused by removing the 
{{listenForAllRecords=true}} that I accidentally left in as part of debugging 
something a while ago. When that is true, we don't cache the records in 
currentSheet, so they are added to the {{extraTextCells}} list.  When that is 
false, which is now the default, the {{LabelRecord}} and {{LabelSSTRecord}} are 
sometimes being overwritten because multiple cells can have the same x/y 
coordinates in the {{currentSheet}} map.

When {{listenForAllRecords=false}}, we're trying to listen for labels, but 
we're often overwriting them because of the map.

Let's add labels to {{extraTextCells}} so that at least the text is processed.

As one example: "africa" in govdocs1/199/199294.ppt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417900#comment-16417900
 ] 

Hudson commented on TIKA-2617:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See 
[https://builds.apache.org/job/tika-branch-1x/11/])
TIKA-2617 -- handle new IOOBE on streams now parsed as npoifs in ppt (tallison: 
[https://github.com/apache/tika/commit/c5cf55f5b7a64219ffc289f48957220d61b0ba86])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> Ignore NPOIFS IOOBE in PPT attachments
> --
>
> Key: TIKA-2617
> URL: https://issues.apache.org/jira/browse/TIKA-2617
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
> these are throwing IOOBE in our regression set.  Rather than throw a runtime 
> exception while trying to parse an embedded stream, let's treat this like any 
> other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417898#comment-16417898
 ] 

Hudson commented on TIKA-2614:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See 
[https://builds.apache.org/job/tika-branch-1x/11/])
TIKA-2614 -- treat simple body inline, not as an attachment (tallison: 
[https://github.com/apache/tika/commit/3ad2274753adadfe5685e1ca40e869f549ef56b5])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (add) 
tika-parsers/src/test/resources/test-documents/testRFC822_simple_inline_body.txt


> RFC822 treats non-multipart as attachment
> -
>
> Key: TIKA-2614
> URL: https://issues.apache.org/jira/browse/TIKA-2614
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
> Attachments: TIKA-2614-from-common-crawl.txt
>
>
> Found during regression testing in prep for 1.18, now that we're identifying 
> a lot more rfc822...for those that have no multipart, we need to treat the 
> body "inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417899#comment-16417899
 ] 

Hudson commented on TIKA-2616:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See 
[https://builds.apache.org/job/tika-branch-1x/11/])
TIKA-2616 -- preserve message/news (tallison: 
[https://github.com/apache/tika/commit/1cd565c1296e815b2f8f052556f9437920181428])
* (add) tika-parsers/src/test/resources/test-documents/testMessageNews.txt
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> message/news now incorrectly identified as rfc822
> -
>
> Key: TIKA-2616
> URL: https://issues.apache.org/jira/browse/TIKA-2616
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
> Let's move the priority for message-id in rfc822 lower to preserve 
> {{message/news}}.
> e.g.: 
> http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417822#comment-16417822
 ] 

Hudson commented on TIKA-2617:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #223 (See 
[https://builds.apache.org/job/tika-2.x-windows/223/])
TIKA-2617 -- handle new IOOBE on streams now parsed as npoifs in ppt (tallison: 
rev acb49581c9be20f0c880400403caa0ea1b30b508)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> Ignore NPOIFS IOOBE in PPT attachments
> --
>
> Key: TIKA-2617
> URL: https://issues.apache.org/jira/browse/TIKA-2617
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
> these are throwing IOOBE in our regression set.  Rather than throw a runtime 
> exception while trying to parse an embedded stream, let's treat this like any 
> other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417770#comment-16417770
 ] 

Hudson commented on TIKA-2614:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See 
[https://builds.apache.org/job/Tika-trunk/1460/])
TIKA-2614 -- treat simple body inline, not as an attachment (tallison: 
[https://github.com/apache/tika/commit/0af75e0a07b6e0ab1d58e06ab103698b3ca233b6])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (add) 
tika-parsers/src/test/resources/test-documents/testRFC822_simple_inline_body.txt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


> RFC822 treats non-multipart as attachment
> -
>
> Key: TIKA-2614
> URL: https://issues.apache.org/jira/browse/TIKA-2614
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
> Attachments: TIKA-2614-from-common-crawl.txt
>
>
> Found during regression testing in prep for 1.18, now that we're identifying 
> a lot more rfc822...for those that have no multipart, we need to treat the 
> body "inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417771#comment-16417771
 ] 

Hudson commented on TIKA-2616:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See 
[https://builds.apache.org/job/Tika-trunk/1460/])
TIKA-2616 -- preserve message/news (tallison: 
[https://github.com/apache/tika/commit/892e38d5aba4e9f1480bce73802a2c2616da1db1])
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (add) tika-parsers/src/test/resources/test-documents/testMessageNews.txt
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> message/news now incorrectly identified as rfc822
> -
>
> Key: TIKA-2616
> URL: https://issues.apache.org/jira/browse/TIKA-2616
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
> Let's move the priority for message-id in rfc822 lower to preserve 
> {{message/news}}.
> e.g.: 
> http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417772#comment-16417772
 ] 

Hudson commented on TIKA-2617:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See 
[https://builds.apache.org/job/Tika-trunk/1460/])
TIKA-2617 -- handle new IOOBE on streams now parsed as npoifs in ppt (tallison: 
[https://github.com/apache/tika/commit/acb49581c9be20f0c880400403caa0ea1b30b508])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> Ignore NPOIFS IOOBE in PPT attachments
> --
>
> Key: TIKA-2617
> URL: https://issues.apache.org/jira/browse/TIKA-2617
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
> these are throwing IOOBE in our regression set.  Rather than throw a runtime 
> exception while trying to parse an embedded stream, let's treat this like any 
> other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2579.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> Update to PDFBox 2.0.9 when available
> -
>
> Key: TIKA-2579
> URL: https://issues.apache.org/jira/browse/TIKA-2579
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Hey team
>  
> We got this report in elasticsearch ingest attachment project: 
> [https://github.com/elastic/elasticsearch/issues/27198]
> Basically when a font is not available PDFBox is throwing an exception like
> {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] 
> [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 
> [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when 
> reading table cmap java.io.IOException: CMap subtype 14 not yet implemented 
> at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
>  at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
>  at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at 
> org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
>  at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
>  at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>  at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>  at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) 
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
> org.apache.tika.Tika.parseToString(Tika.java:537)}}
> This might have been solved by PDFParser with 
> https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in 
> PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue 
> https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will 
> actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could 
> be useful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2607.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2617.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> Ignore NPOIFS IOOBE in PPT attachments
> --
>
> Key: TIKA-2617
> URL: https://issues.apache.org/jira/browse/TIKA-2617
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
> these are throwing IOOBE in our regression set.  Rather than throw a runtime 
> exception while trying to parse an embedded stream, let's treat this like any 
> other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2616.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> message/news now incorrectly identified as rfc822
> -
>
> Key: TIKA-2616
> URL: https://issues.apache.org/jira/browse/TIKA-2616
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>
> Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
> Let's move the priority for message-id in rfc822 lower to preserve 
> {{message/news}}.
> e.g.: 
> http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2614.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> RFC822 treats non-multipart as attachment
> -
>
> Key: TIKA-2614
> URL: https://issues.apache.org/jira/browse/TIKA-2614
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
> Attachments: TIKA-2614-from-common-crawl.txt
>
>
> Found during regression testing in prep for 1.18, now that we're identifying 
> a lot more rfc822...for those that have no multipart, we need to treat the 
> body "inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417708#comment-16417708
 ] 

Hudson commented on TIKA-2616:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #222 (See 
[https://builds.apache.org/job/tika-2.x-windows/222/])
TIKA-2616 -- preserve message/news (tallison: rev 
892e38d5aba4e9f1480bce73802a2c2616da1db1)
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (add) tika-parsers/src/test/resources/test-documents/testMessageNews.txt


> message/news now incorrectly identified as rfc822
> -
>
> Key: TIKA-2616
> URL: https://issues.apache.org/jira/browse/TIKA-2616
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
> Let's move the priority for message-id in rfc822 lower to preserve 
> {{message/news}}.
> e.g.: 
> http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417707#comment-16417707
 ] 

Hudson commented on TIKA-2614:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #222 (See 
[https://builds.apache.org/job/tika-2.x-windows/222/])
TIKA-2614 -- treat simple body inline, not as an attachment (tallison: rev 
0af75e0a07b6e0ab1d58e06ab103698b3ca233b6)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (add) 
tika-parsers/src/test/resources/test-documents/testRFC822_simple_inline_body.txt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


> RFC822 treats non-multipart as attachment
> -
>
> Key: TIKA-2614
> URL: https://issues.apache.org/jira/browse/TIKA-2614
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Attachments: TIKA-2614-from-common-crawl.txt
>
>
> Found during regression testing in prep for 1.18, now that we're identifying 
> a lot more rfc822...for those that have no multipart, we need to treat the 
> body "inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt

2018-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417703#comment-16417703
 ] 

Tim Allison commented on TIKA-2569:
---

Whoa!  This added a huge amount of newly extracted text in our regression 
corpus.  Thank you, [~BAEApache]!

> Grouped Text boxes in .ppt
> --
>
> Key: TIKA-2569
> URL: https://issues.apache.org/jira/browse/TIKA-2569
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Richard A
>Assignee: Tim Allison
>Priority: Major
>  Labels: easyfix
> Fix For: 1.18, 2.0.0
>
> Attachments: Presentation1.ppt, Presentation1.pptx
>
>
> Grouped Text boxes are unable to be parsed and no content is returned when 
> items have been grouped together. This issue does not seem to affect .pptx 
> files, only .ppt. The attached documents are the same except the file format. 
> It should give a very simple example of a .ppt document where no content will 
> be returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417688#comment-16417688
 ] 

Tim Allison commented on TIKA-2617:
---

e.g. govdocs1/206/206668.ppt and govdocs1/164/164761.ppt

> Ignore NPOIFS IOOBE in PPT attachments
> --
>
> Key: TIKA-2617
> URL: https://issues.apache.org/jira/browse/TIKA-2617
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
> these are throwing IOOBE in our regression set.  Rather than throw a runtime 
> exception while trying to parse an embedded stream, let's treat this like any 
> other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2617:
-

 Summary: Ignore NPOIFS IOOBE in PPT attachments
 Key: TIKA-2617
 URL: https://issues.apache.org/jira/browse/TIKA-2617
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


TIKA-2588 has us trying to parse more embedded streams as npoifs.  Some of 
these are throwing IOOBE in our regression set.  Rather than throw a runtime 
exception while trying to parse an embedded stream, let's treat this like any 
other embedded stream IOException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417577#comment-16417577
 ] 

Hudson commented on TIKA-2579:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #221 (See 
[https://builds.apache.org/job/tika-2.x-windows/221/])
 TIKA-2579 and TIKA-2607: Upgrade PDFBox to 2.0.9 and include new (tallison: 
rev ee9e4f445dc8801fe69b5d7702c27aecbf9a6efd)
* (edit) CHANGES.txt
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
* (edit) tika-parsers/pom.xml
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Update to PDFBox 2.0.9 when available
> -
>
> Key: TIKA-2579
> URL: https://issues.apache.org/jira/browse/TIKA-2579
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Major
>
> Hey team
>  
> We got this report in elasticsearch ingest attachment project: 
> [https://github.com/elastic/elasticsearch/issues/27198]
> Basically when a font is not available PDFBox is throwing an exception like
> {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] 
> [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 
> [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when 
> reading table cmap java.io.IOException: CMap subtype 14 not yet implemented 
> at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
>  at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
>  at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at 
> org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
>  at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
>  at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>  at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>  at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) 
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
> org.apache.tika.Tika.parseToString(Tika.java:537)}}
> This might have been solved by PDFParser with 
> https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in 
> PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue 
> https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will 
> actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could 
> be useful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417578#comment-16417578
 ] 

Hudson commented on TIKA-2607:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #221 (See 
[https://builds.apache.org/job/tika-2.x-windows/221/])
 TIKA-2579 and TIKA-2607: Upgrade PDFBox to 2.0.9 and include new (tallison: 
rev ee9e4f445dc8801fe69b5d7702c27aecbf9a6efd)
* (edit) tika-parsers/pom.xml
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
* (edit) CHANGES.txt


> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Chris Mattmann
+1

 

 

From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 28, 2018 at 8:01 AM
To: "dev@tika.apache.org" 
Subject: Re: message/news; charset=windows-1252 -> message/rfc822

 

On Wed, 28 Mar 2018, Allison, Timothy B. wrote:

  With the new mime patterns, we've gotten quite a few changes of 

message/news being identified as message/rfc822.  An example is:

 

http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5

 

That looks like a regression to me, it's really news

 

We should correct this, right?  Any recommendations?

 

I think it's the Message-ID header it's matching on. I'd suggest we bump 

the news magics up from 50 (same as rfc822) to 60, so the news ones take 

preference

 

Nick

 



[jira] [Resolved] (TIKA-2615) mbox incorrectly identified as RFC822

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2615.
---
Resolution: Duplicate

Oops...these are {{message/news}}...duplicate issue... I think.

> mbox incorrectly identified as RFC822
> -
>
> Key: TIKA-2615
> URL: https://issues.apache.org/jira/browse/TIKA-2615
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> With our new mime magic, there are quite a few files that used to be 
> identified as "text/plain", but they're now identified as RFC822.  In the 
> following cases, the correct categorization would be {{application/mbox}}...I 
> think?
> Examples:
> http://162.242.228.174/docs/govdocs1/132/132113.txt
> 'govdocs1/215/215590.txt'
> 'govdocs1/332/332894.txt'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2616) message/news identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2616:
-

 Summary: message/news identified as rfc822
 Key: TIKA-2616
 URL: https://issues.apache.org/jira/browse/TIKA-2616
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
Let's move the priority for message-id in rfc822 lower to preserve 
{{message/news}}.

e.g.: 
http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2616:
--
Summary: message/news now incorrectly identified as rfc822  (was: 
message/news identified as rfc822)

> message/news now incorrectly identified as rfc822
> -
>
> Key: TIKA-2616
> URL: https://issues.apache.org/jira/browse/TIKA-2616
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Thanks to [~gagravarr] on the dev list for confirming, this is a regression.  
> Let's move the priority for message-id in rfc822 lower to preserve 
> {{message/news}}.
> e.g.: 
> http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417510#comment-16417510
 ] 

Hudson commented on TIKA-2579:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1459 (See 
[https://builds.apache.org/job/Tika-trunk/1459/])
 TIKA-2579 and TIKA-2607: Upgrade PDFBox to 2.0.9 and include new (tallison: 
[https://github.com/apache/tika/commit/ee9e4f445dc8801fe69b5d7702c27aecbf9a6efd])
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) CHANGES.txt
* (edit) tika-parsers/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java


> Update to PDFBox 2.0.9 when available
> -
>
> Key: TIKA-2579
> URL: https://issues.apache.org/jira/browse/TIKA-2579
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Major
>
> Hey team
>  
> We got this report in elasticsearch ingest attachment project: 
> [https://github.com/elastic/elasticsearch/issues/27198]
> Basically when a font is not available PDFBox is throwing an exception like
> {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] 
> [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 
> [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when 
> reading table cmap java.io.IOException: CMap subtype 14 not yet implemented 
> at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
>  at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
>  at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at 
> org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
>  at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
>  at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>  at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>  at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) 
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
> org.apache.tika.Tika.parseToString(Tika.java:537)}}
> This might have been solved by PDFParser with 
> https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in 
> PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue 
> https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will 
> actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could 
> be useful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417511#comment-16417511
 ] 

Hudson commented on TIKA-2607:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1459 (See 
[https://builds.apache.org/job/Tika-trunk/1459/])
 TIKA-2579 and TIKA-2607: Upgrade PDFBox to 2.0.9 and include new (tallison: 
[https://github.com/apache/tika/commit/ee9e4f445dc8801fe69b5d7702c27aecbf9a6efd])
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
* (edit) tika-parsers/pom.xml
* (edit) CHANGES.txt


> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2615) mbox incorrectly identified as RFC822

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2615:
-

 Summary: mbox incorrectly identified as RFC822
 Key: TIKA-2615
 URL: https://issues.apache.org/jira/browse/TIKA-2615
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


With our new mime magic, there are quite a few files that used to be identified 
as "text/plain", but they're now identified as RFC822.  In the following cases, 
the correct categorization would be {{application/mbox}}...I think?

Examples:
http://162.242.228.174/docs/govdocs1/132/132113.txt

'govdocs1/215/215590.txt'
'govdocs1/332/332894.txt'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Nick Burch

On Wed, 28 Mar 2018, Allison, Timothy B. wrote:
 With the new mime patterns, we've gotten quite a few changes of 
message/news being identified as message/rfc822.  An example is:


http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5


That looks like a regression to me, it's really news


We should correct this, right?  Any recommendations?


I think it's the Message-ID header it's matching on. I'd suggest we bump 
the news magics up from 50 (same as rfc822) to 60, so the news ones take 
preference


Nick


1.18 pre rc regression tests

2018-03-28 Thread Allison, Timothy B.
All,
I've run the initial regression tests.  The corpus size is now big enough that 
I have to migrate the H2 tables to postgres before writing the reports.  I'll 
post the reports as soon as they're finally ready, but I'm starting to go 
through some results now.

Cheers,

Tim



message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Allison, Timothy B.
All,
  With the new mime patterns, we've gotten quite a few changes of message/news 
being identified as message/rfc822.  An example is:

http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5

We should correct this, right?  Any recommendations?

   Best,

  Tim



Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer
T835/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)




[jira] [Updated] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2614:
--
Attachment: TIKA-2614-from-common-crawl.txt

> RFC822 treats non-multipart as attachment
> -
>
> Key: TIKA-2614
> URL: https://issues.apache.org/jira/browse/TIKA-2614
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Attachments: TIKA-2614-from-common-crawl.txt
>
>
> Found during regression testing in prep for 1.18, now that we're identifying 
> a lot more rfc822...for those that have no multipart, we need to treat the 
> body "inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2614:
-

 Summary: RFC822 treats non-multipart as attachment
 Key: TIKA-2614
 URL: https://issues.apache.org/jira/browse/TIKA-2614
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Found during regression testing in prep for 1.18, now that we're identifying a 
lot more rfc822...for those that have no multipart, we need to treat the body 
"inline" and not as an attachment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)