Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
Yes, there is a big reason. It’s b/c you don’t have to have an external 
server running to use it with tika-dl. And of course you can static analyze
the code (which you have to mix languages for that with the other solution), 
etc.

 

So yes, we should keep them both…

 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 4:30 PM
To: "dev@tika.apache.org" 
Subject: Re: image recognition...how do the parts play together?

 

This is very helpful. Thank you! Is there any use in having the tika-dl

module if our more modern approach is REST + Docker? The upkeep in tika-dl

is nontrivial.

 

On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann  wrote:

 

Tim,

 

 

 

Thanks. There are multiple modes of integrating deep learning with Tika:

 

 

The original mode: uses Thamme’s work on REST exposing Tensorflow

and Docker to provide a REST Service to Tika to allow for running

Tensorflow

DL models. We initially did Inception_v3, and a model by Madhav Sharan

that combines OpenCV

with Inception v3 (and a new docker that installs OpenCV it’s a pain) for

image

and video object recognition, respectively. See:

https://github.com/apache/tika/pull/208

and https://github.com/apache/tika/pull/168 and also the wiki

Later, Thamme, Avtar Singh, KranthiGV, added DL4J support:

https://github.com/apache/tika/pull/165

including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182

This houses the model in USC Data science repo and uses it as an example

for how to store and load models from Keras/Python into DL4j:

 

https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data

Then, Thejan added Text Captioning and a new Docker, and trained model:

https://github.com/apache/tika/pull/180

Then Raunaq from UPenn added Inception v4 support via the

Docker/Tensorflow way:

https://github.com/apache/tika/pull/162

All this Docker work caused Thejan and others to think we needed to

refactor the dockers. We did

that here: https://github.com/apache/tika/pull/208 to make them cleaner,

and to depend on:

http://github.com/USCDataScience/tika-dockers/ and on

http://github.com/USCDataScience/img2text

models for image captioning. Now, Video and Image recognition and Image

Captioning all had the same

base docker and sub dockers from that.

 

 

That’s where we’re at today. Make sense? ☺ Thejan and others want to add

more DL4J supported models

and we can always use Tensorflow/Docker as well as a way of doing it.

 

 

 

Cheers,

 

Chris

 

 

 

 

 

 

 

 

 

From: Tim Allison 

Reply-To: "dev@tika.apache.org" 

Date: Friday, July 6, 2018 at 2:39 PM

To: "dev@tika.apache.org" 

Subject: image recognition...how do the parts play together?

 

 

 

On Twitter, Chris, Thamme, Thejan, and I are working with some

 

deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA

 

(TIKA-2672).

 

 

 

I initially requested help from Thejan (and Thamme :D) for this because we

 

were getting an initialization exception after the upgrade in tika-dl's

 

DL4JInceptionV3Net.

 

 

 

According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding

 

the TensorFlowRESTRecogniser...does this mean we can get rid of

 

DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help

 

with?

 

 

 

How do these recognizers play together?

 

 

 

Thank you.

 

 

 

Cheers,

 

 

 

  Tim

 

 

 

[1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617

 

[2] https://wiki.apache.org/tika/TikaAndVision

 

 

 

 

 



Re: image recognition...how do the parts play together?

2018-07-06 Thread Tim Allison
This is very helpful. Thank you! Is there any use in having the tika-dl
module if our more modern approach is REST + Docker? The upkeep in tika-dl
is nontrivial.

On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann  wrote:

> Tim,
>
>
>
> Thanks. There are multiple modes of integrating deep learning with Tika:
>
>
> The original mode: uses Thamme’s work on REST exposing Tensorflow
> and Docker to provide a REST Service to Tika to allow for running
> Tensorflow
> DL models. We initially did Inception_v3, and a model by Madhav Sharan
> that combines OpenCV
> with Inception v3 (and a new docker that installs OpenCV it’s a pain) for
> image
> and video object recognition, respectively. See:
> https://github.com/apache/tika/pull/208
> and https://github.com/apache/tika/pull/168 and also the wiki
> Later, Thamme, Avtar Singh, KranthiGV, added DL4J support:
> https://github.com/apache/tika/pull/165
> including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182
> This houses the model in USC Data science repo and uses it as an example
> for how to store and load models from Keras/Python into DL4j:
>
> https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data
> Then, Thejan added Text Captioning and a new Docker, and trained model:
> https://github.com/apache/tika/pull/180
> Then Raunaq from UPenn added Inception v4 support via the
> Docker/Tensorflow way:
> https://github.com/apache/tika/pull/162
> All this Docker work caused Thejan and others to think we needed to
> refactor the dockers. We did
> that here: https://github.com/apache/tika/pull/208 to make them cleaner,
> and to depend on:
> http://github.com/USCDataScience/tika-dockers/ and on
> http://github.com/USCDataScience/img2text
> models for image captioning. Now, Video and Image recognition and Image
> Captioning all had the same
> base docker and sub dockers from that.
>
>
> That’s where we’re at today. Make sense? ☺ Thejan and others want to add
> more DL4J supported models
> and we can always use Tensorflow/Docker as well as a way of doing it.
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" 
> Date: Friday, July 6, 2018 at 2:39 PM
> To: "dev@tika.apache.org" 
> Subject: image recognition...how do the parts play together?
>
>
>
> On Twitter, Chris, Thamme, Thejan, and I are working with some
>
> deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA
>
> (TIKA-2672).
>
>
>
> I initially requested help from Thejan (and Thamme :D) for this because we
>
> were getting an initialization exception after the upgrade in tika-dl's
>
> DL4JInceptionV3Net.
>
>
>
> According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding
>
> the TensorFlowRESTRecogniser...does this mean we can get rid of
>
> DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help
>
> with?
>
>
>
> How do these recognizers play together?
>
>
>
> Thank you.
>
>
>
> Cheers,
>
>
>
>  Tim
>
>
>
> [1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617
>
> [2] https://wiki.apache.org/tika/TikaAndVision
>
>
>
>


Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
Tim,

 

Thanks. There are multiple modes of integrating deep learning with Tika:

 
The original mode: uses Thamme’s work on REST exposing Tensorflow
and Docker to provide a REST Service to Tika to allow for running Tensorflow
DL models. We initially did Inception_v3, and a model by Madhav Sharan that 
combines OpenCV
with Inception v3 (and a new docker that installs OpenCV it’s a pain) for image
and video object recognition, respectively. See: 
https://github.com/apache/tika/pull/208 
and https://github.com/apache/tika/pull/168 and also the wiki 
Later, Thamme, Avtar Singh, KranthiGV, added DL4J support:
https://github.com/apache/tika/pull/165 
including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182 
This houses the model in USC Data science repo and uses it as an example
for how to store and load models from Keras/Python into DL4j:
https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data
 
Then, Thejan added Text Captioning and a new Docker, and trained model:
https://github.com/apache/tika/pull/180 
Then Raunaq from UPenn added Inception v4 support via the Docker/Tensorflow way:
https://github.com/apache/tika/pull/162 
All this Docker work caused Thejan and others to think we needed to refactor 
the dockers. We did
that here: https://github.com/apache/tika/pull/208 to make them cleaner, and to 
depend on:
http://github.com/USCDataScience/tika-dockers/ and on 
http://github.com/USCDataScience/img2text 
models for image captioning. Now, Video and Image recognition and Image 
Captioning all had the same
base docker and sub dockers from that.
 

That’s where we’re at today. Make sense? ☺ Thejan and others want to add more 
DL4J supported models
and we can always use Tensorflow/Docker as well as a way of doing it.

 

Cheers,

Chris

 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 2:39 PM
To: "dev@tika.apache.org" 
Subject: image recognition...how do the parts play together?

 

On Twitter, Chris, Thamme, Thejan, and I are working with some

deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA

(TIKA-2672).

 

I initially requested help from Thejan (and Thamme :D) for this because we

were getting an initialization exception after the upgrade in tika-dl's

DL4JInceptionV3Net.

 

According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding

the TensorFlowRESTRecogniser...does this mean we can get rid of

DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help

with?

 

How do these recognizers play together?

 

Thank you.

 

Cheers,

 

 Tim

 

[1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617

[2] https://wiki.apache.org/tika/TikaAndVision

 



[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2680:
--
Attachment: main_email_in_outlook.jpg

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535412#comment-16535412
 ] 

Tim Allison commented on TIKA-2680:
---

Given that Outlook appears to treat this as an attachment, are you ok if we do 
the same? !main_email_in_outlook.jpg! 

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2680:
--
Attachment: (was: main_email_in_outlook.jpg)

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2680:
--
Attachment: main_email_in_outlook.jpg

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


image recognition...how do the parts play together?

2018-07-06 Thread Tim Allison
On Twitter, Chris, Thamme, Thejan, and I are working with some
deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA
(TIKA-2672).

I initially requested help from Thejan (and Thamme :D) for this because we
were getting an initialization exception after the upgrade in tika-dl's
DL4JInceptionV3Net.

According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding
the TensorFlowRESTRecogniser...does this mean we can get rid of
DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help
with?

How do these recognizers play together?

Thank you.

Cheers,

 Tim

[1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617
[2] https://wiki.apache.org/tika/TikaAndVision


[jira] [Comment Edited] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535351#comment-16535351
 ] 

Yury Kats edited comment on TIKA-2680 at 7/6/18 9:07 PM:
-

Indeed, the first embedded rfc822 is not an attachment. I believe this is 
because it's an Exchange journaled email, see the presence of 
X-MS-Journal-Report header at the very top. 
In this case, the original message is wrapped in another message that can 
provide additional headers, such as Bcc and expanded distribution lists.


was (Author: yurykats):
Indeed, the first embedded rfc822 is not an attachment. I believe this is 
because it's an Exchange journaled email, see the presence of 
X-MS-Journal-Report header at the very top. 

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535351#comment-16535351
 ] 

Yury Kats commented on TIKA-2680:
-

Indeed, the first embedded rfc822 is not an attachment. I believe this is 
because it's an Exchange journaled email, see the presence of 
X-MS-Journal-Report header at the very top. 

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535339#comment-16535339
 ] 

Tim Allison commented on TIKA-2680:
---

Something like this?

{noformat}
multipart/mixed (uses _728aa617-16cf-4d95-8bc2-9f1868397202_)
text/plain (_728aa617-16cf-4d95-8bc2-9f1868397202_)  
(sender and some other headers, no real content "Message-Id: 
  <0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>")
message/rfc822 (_728aa617-16cf-4d95-8bc2-9f1868397202_)
uses 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
multipart/alternative 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
uses 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
text/plain 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
text/html 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
 ("cocacola Henry van 
der Smith")
end 
_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_
message/rfc822 (content-disposition: attachment) 
 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
uses 
(_004_8075737674787666767166806676697476787366657271727266777_)
multipart/alternative 
(_004_8075737674787666767166806676697476787366657271727266777_)
uses 
(_000_8075737674787666767166806676697476787366657271727266777_)
text/plain 
(_000_8075737674787666767166806676697476787366657271727266777_)
text/html 
(_000_8075737674787666767166806676697476787366657271727266777_) 
 ("Cocacola test 
Henry van der Smith")
end 
_000_8075737674787666767166806676697476787366657271727266777_
message/rfc822 (content-disposition: attachment 
text/plain)
 
(004_8075737674787666767166806676697476787366657271727266777)
no multipart body, just plain 
text: ("I won't be able to attend")
end _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_
end _728aa617-16cf-4d95-8bc2-9f1868397202_
{noformat}

As you point out...it is mildly odd (to me at least) that the first embedded 
rfc822 (the one that uses 
_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom) does not have 
content-disposition: attachment.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> 

[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535296#comment-16535296
 ] 

Yury Kats commented on TIKA-2685:
-

Yes, correct, this is govern by RFC 3642, sorry I didn't mention this upfront.

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535288#comment-16535288
 ] 

Tim Allison commented on TIKA-2685:
---

https://tools.ietf.org/html/rfc3462 page 2 describes exactly this...yay! With 
1) and 2) both [Required] and 3) the original rfc822 being [Optional]

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271
 ] 

Yury Kats edited comment on TIKA-2685 at 7/6/18 8:03 PM:
-

delivery-status and message/rfc822 are inside multipart/report

{noformat}
multipart/report
  multipart/alternative
  text/plain
  text/html
  end
  message/delivery-status
  message/rfc822
end
{noformat}


was (Author: yurykats):
delivery-status and message/rfc822 are inside multipart/report

{noformat}
multipart/report
   multipart/alternative
  text/plain
  text/html
  end
  message/delivery-status
  message/rfc822
end
{noformat}

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535275#comment-16535275
 ] 

Tim Allison commented on TIKA-2685:
---

I think I agree...the first rfc822 (multipart/report) has three parts:

1. multipart/alternative
2. message/delivery-status
3. message/rfc822 (the original eml)

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271
 ] 

Yury Kats edited comment on TIKA-2685 at 7/6/18 8:02 PM:
-

delivery-status and message/rfc822 are inside multipart/report

{noformat}
multipart/report
   multipart/alternative
  text/plain
  text/html
  end
  message/delivery-status
  message/rfc822
end
{noformat}


was (Author: yurykats):
delivery-status and message/rfc822 are inside multipart/report

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535267#comment-16535267
 ] 

Tim Allison edited comment on TIKA-2685 at 7/6/18 8:00 PM:
---

Is this your understanding of the structure?

{noformat}
multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
message/rfc822 (multipart/report::report-type=delivery-status) 
(_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
(uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
(uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) 

message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
text/html
(end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
{noformat}




was (Author: talli...@mitre.org):
Is this your understanding of the structure?

{noformat}
multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
message/rfc822 (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
(uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
(uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) 

message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
text/html
(end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
{noformat}



> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> 

[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271
 ] 

Yury Kats commented on TIKA-2685:
-

delivery-status and message/rfc822 are inside multipart/report

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535267#comment-16535267
 ] 

Tim Allison commented on TIKA-2685:
---

Is this your understanding of the structure?

{noformat}
multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
message/rfc822 (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
(uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
(uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_)
text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) 

message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
text/html
(end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_)
end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_)
{noformat}



> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> 

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535147#comment-16535147
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #56 (See 
[https://builds.apache.org/job/tika-branch-1x/56/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
[https://github.com/apache/tika/commit/525889a4f928d1d448c6aaf6b1ddc19081e07404])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535145#comment-16535145
 ] 

Hudson commented on TIKA-2673:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1517 (See 
[https://builds.apache.org/job/Tika-trunk/1517/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
[https://github.com/apache/tika/commit/790c1248207371e6cb2a3e7a1ec3a021503ec7a4])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535135#comment-16535135
 ] 

Yury Kats commented on TIKA-2685:
-

For my own immediate needs, I modified MimeStreamParser to call 
ContentHandler#startMessage with the stream, ie instead of
{code}
handler.startMessage()
{code}
I call
{code}
handler.startMessage(mimeTokenStream.getInputStream())
{code}
I then modified the MailContentHandler to have startMessage(InputStream is) 
method where I check that it's inside of "multipart/report" and then invoke 
handleEmbedded
{code}
public void startMessage(InputStream is) throws MimeException {
boolean attachedMessage = parts.size() > 0 && 
parts.peek().getMimeType().equals("multipart/report");
if (!attachedMessage) {
startMessage();
} else {
Metadata submd = new Metadata();
submd.set(Metadata.CONTENT_TYPE, "message/rfc822");
submd.set(Metadata.CONTENT_DISPOSITION, "attachment");
try (TikaInputStream tis = TikaInputStream.get(is)) {
handleEmbedded(tis, submd);
} catch (IOException e) {
throw new MimeException(e);
}
}
}
{code}

I've made a similar change for TIKA-2680, only there I detect that the message 
is an attachment in MimeEntity#advance, and then end up calling the new 
startMessage with the stream and then handleEmbedded.

Not sure if these are the best ways of solving these issues though. Looking 
forward to your take on them.

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": 

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535130#comment-16535130
 ] 

Hudson commented on TIKA-2673:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #282 (See 
[https://builds.apache.org/job/tika-2.x-windows/282/])
TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: 
rev 790c1248207371e6cb2a3e7a1ec3a021503ec7a4)
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java


> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535100#comment-16535100
 ] 

Tim Allison commented on TIKA-2685:
---

[~yurykats], thank you for identifying this problem and TIKA-2680 and sharing 
them with us!  I have a couple of other things on my plate, but I _hope_ to 
turn to these fairly soon (next week or so?).  Your example files and diagnoses 
are very helpful.  Thank you!

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2685) Email attached to an undeliverable email report are not extracted

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-2685:
-

Assignee: Tim Allison

> Email attached to an undeliverable email report are not extracted
> -
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
> {
> "Author": "postmas...@bank.com",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "postmas...@bank.com",
> "Message-To": "uatalert...@logscape.com",
> "Message:From-Email": "postmas...@bank.com",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "postmas...@bank.com",
> "dc:creator": "postmas...@bank.com",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "postmas...@bank.com",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535041#comment-16535041
 ] 

Tim Allison commented on TIKA-2673:
---

I've added this to both 'master' and 'branch_1x'.  Let me know if you disagree 
with this or would like to make modifications.

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tika 1.19?

2018-07-06 Thread Chris Mattmann
Once tika-dl works again with Inception v4, I’m good ☺

 

I’m working on adding some more models to tika-dl and other things
but those can come after 1.19.

 

Cheers,

Chris

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 8:40 AM
To: "dev@tika.apache.org" 
Subject: Tika 1.19?

 

All,

 

  We've made quite a few improvements, what would you think of starting the

release process in a couple of weeks...say, July 23ish?

 

  I'd like to complete the dl4j upgrade and update some of our dependencies

so that we can at least build with Java 11.

 

  Any blockers or other things people want to get in?

 

   Cheers,

 

Tim

 



Tika 1.19?

2018-07-06 Thread Tim Allison
All,

  We've made quite a few improvements, what would you think of starting the
release process in a couple of weeks...say, July 23ish?

  I'd like to complete the dl4j upgrade and update some of our dependencies
so that we can at least build with Java 11.

  Any blockers or other things people want to get in?

   Cheers,

Tim


[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534994#comment-16534994
 ] 

Tim Allison commented on TIKA-2672:
---

Fantastic!  Thank you [~ThejanWijesinghe]! 

> Upgrade dl4j to 1.0.0-beta
> --
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got 
> this error when reading the json config file.  Can someone with more 
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid 
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) 
> for width dimension:  Invalid input configuration for kernel width. Require 0 
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides 
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution 
> mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534994#comment-16534994
 ] 

Tim Allison edited comment on TIKA-2672 at 7/6/18 3:30 PM:
---

Fantastic!  Thank you [~ThejanWijesinghe]! And, y, I spent some time tweaking 
some of the parameters, but with no luck.  I think we need expert help. :)


was (Author: talli...@mitre.org):
Fantastic!  Thank you [~ThejanWijesinghe]! 

> Upgrade dl4j to 1.0.0-beta
> --
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got 
> this error when reading the json config file.  Can someone with more 
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid 
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) 
> for width dimension:  Invalid input configuration for kernel width. Require 0 
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides 
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution 
> mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534992#comment-16534992
 ] 

Tim Allison commented on TIKA-2673:
---

[~gbouchar], thank you for contributing this!  I won't have time to run the 
regression tests any time soon.  Would you be ok if I added your 
StrictHtmlEncodingDetector to Tika now?  Users would then be able to configure 
Tika to use it via tika-config.xml.  If you're ok with this, is it ok if I add 
the Apache Software License 2.0 headers to your main class, test class and .tsv 
files?

 

Thank you, again!

> HtmlEncodingDetector doesn't follow the specification
> -
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534990#comment-16534990
 ] 

Hudson commented on TIKA-2675:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #55 (See 
[https://builds.apache.org/job/tika-branch-1x/55/])
TIKA-2675 OpenDocumentParser should fail on invalid zip files - throw (snagel: 
[https://github.com/apache/tika/commit/def58f6e84de61031212ae40714e02f05d3c19fc])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
* (add) tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt


> OpenDocumentParser should fail on invalid zip files
> ---
>
> Key: TIKA-2675
> URL: https://issues.apache.org/jira/browse/TIKA-2675
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> The OpenDocumentParser assumes a zip file as container. However, if it is 
> called on an invalid zip stream from a remote URL (see NUTCH-2603), the 
> parser signals success and returns a document with no/empty content. The 
> behavior is different when called on a local file: while the [constructor of 
> ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-]
>  fails on invalid input, the [constructor of 
> ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-]
>  silently ignores the input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

2018-07-06 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534984#comment-16534984
 ] 

Chris A. Mattmann commented on TIKA-2672:
-

GREAT WORK [~ThejanWijesinghe] thanks my guy

> Upgrade dl4j to 1.0.0-beta
> --
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got 
> this error when reading the json config file.  Can someone with more 
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid 
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) 
> for width dimension:  Invalid input configuration for kernel width. Require 0 
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides 
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution 
> mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta

2018-07-06 Thread Thejan Wijesinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534966#comment-16534966
 ] 

Thejan Wijesinghe commented on TIKA-2672:
-

[~talli...@apache.org] sorry for the delay, so the dl4j update works fine for 
vgg, but not with inception v3. Spent almost two days trying to fix this. 
First, I updated the keras 1-inception v3 model with keras 2-inception v3, but 
couldn't get this working, then I assumed this should be a problem related to 
the dynamic input shapes(None shapes) of the model, so changed the entire model 
to static shapes starting from the input layer's change to (None, None, 3) to 
(299, 299, 3) but still no luck, I didn't see any breaking changes of their api 
either, so filed an issue today 
[https://github.com/deeplearning4j/deeplearning4j/issues/5831] , they have 
ignored the inception v3 test in their tests too. Anyway if they are targeting 
to support full backward capability, the model import api should work with a 
keras-1 model as well. Let's wait for their instructions.   

> Upgrade dl4j to 1.0.0-beta
> --
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j.  I think I got us most of the way there, but I got 
> this error when reading the json config file.  Can someone with more 
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid 
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) 
> for width dimension:  Invalid input configuration for kernel width. Require 0 
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides 
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution 
> mode = Truncate
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files

2018-07-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534858#comment-16534858
 ] 

Hudson commented on TIKA-2675:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1516 (See 
[https://builds.apache.org/job/Tika-trunk/1516/])
TIKA-2675 -- OpenDocumentParser should fail on invalid zip via Sebastian 
(tallison: 
[https://github.com/apache/tika/commit/c9a81a400ee10e9342bbfe718d62f0b0d6c7944f])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
* (add) tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java


> OpenDocumentParser should fail on invalid zip files
> ---
>
> Key: TIKA-2675
> URL: https://issues.apache.org/jira/browse/TIKA-2675
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> The OpenDocumentParser assumes a zip file as container. However, if it is 
> called on an invalid zip stream from a remote URL (see NUTCH-2603), the 
> parser signals success and returns a document with no/empty content. The 
> behavior is different when called on a local file: while the [constructor of 
> ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-]
>  fails on invalid input, the [constructor of 
> ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-]
>  silently ignores the input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2675) OpenDocumentParser should fail on invalid zip files

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2675.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.19

Thank you [~wastl-nagel]!

> OpenDocumentParser should fail on invalid zip files
> ---
>
> Key: TIKA-2675
> URL: https://issues.apache.org/jira/browse/TIKA-2675
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> The OpenDocumentParser assumes a zip file as container. However, if it is 
> called on an invalid zip stream from a remote URL (see NUTCH-2603), the 
> parser signals success and returns a document with no/empty content. The 
> behavior is different when called on a local file: while the [constructor of 
> ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-]
>  fails on invalid input, the [constructor of 
> ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-]
>  silently ignores the input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files

2018-07-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534789#comment-16534789
 ] 

ASF GitHub Bot commented on TIKA-2675:
--

tballison closed pull request #240: TIKA-2675 OpenDocumentParser should fail on 
invalid zip files
URL: https://github.com/apache/tika/pull/240
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
index e7eb76ab8..7ed77fbc1 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
@@ -26,6 +26,7 @@
 import java.util.HashSet;
 import java.util.Set;
 import java.util.zip.ZipEntry;
+import java.util.zip.ZipException;
 import java.util.zip.ZipFile;
 import java.util.zip.ZipInputStream;
 
@@ -175,10 +176,13 @@ public void parse(
 
 private void handleZipStream(ZipInputStream zipStream, Metadata metadata, 
ParseContext context, EndDocumentShieldingContentHandler handler) throws 
IOException, TikaException, SAXException {
 ZipEntry entry = zipStream.getNextEntry();
-while (entry != null) {
+   if (entry == null) {
+   throw new IOException("No entries found in 
ZipInputStream");
+   }
+do {
 handleZipEntry(entry, zipStream, metadata, context, handler);
 entry = zipStream.getNextEntry();
-}
+} while (entry != null);
 }
 
 private void handleZipFile(ZipFile zipFile, Metadata metadata,
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
index d1ec46e94..ba6c0ca11 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
@@ -19,6 +19,7 @@
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;
 
+import java.io.IOException;
 import java.io.InputStream;
 import java.util.List;
 
@@ -406,6 +407,28 @@ public void testEmbedded() throws Exception {
 assertEquals(3, metadataList.size());
 }
 
+@Test(expected = IOException.class)
+public void testInvalidFromStream() throws Exception {
+try (InputStream is = this.getClass().getResource(
+"/test-documents/testODTnotaZipFile.odt").openStream()) {
+OpenDocumentParser parser = new OpenDocumentParser();
+Metadata metadata = new Metadata();
+ContentHandler handler = new BodyContentHandler();
+parser.parse(is, handler, metadata, new ParseContext());
+}
+}
+
+@Test(expected = IOException.class)
+public void testInvalidFromFile() throws Exception {
+try (TikaInputStream tis = 
TikaInputStream.get(this.getClass().getResource(
+"/test-documents/testODTnotaZipFile.odt"))) {
+OpenDocumentParser parser = new OpenDocumentParser();
+Metadata metadata = new Metadata();
+ContentHandler handler = new BodyContentHandler();
+parser.parse(tis, handler, metadata, new ParseContext());
+}
+}
+
 private ParseContext getNonRecursingParseContext() {
 ParseContext parseContext = new ParseContext();
 parseContext.set(Parser.class, new EmptyParser());
diff --git 
a/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt 
b/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt
new file mode 100644
index 0..9c1d376f0
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt
@@ -0,0 +1 @@
+This is not a zip file!


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OpenDocumentParser should fail on invalid zip files
> ---
>
> Key: TIKA-2675
> URL: https://issues.apache.org/jira/browse/TIKA-2675
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
>
> The OpenDocumentParser assumes a zip file as container. However, if it 

[jira] [Commented] (TIKA-874) Identify FITS (Flexible Image Transport System) files

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534784#comment-16534784
 ] 

Tim Allison commented on TIKA-874:
--

See TIKA-2684 for how to configure GDAL to parse FITS...many thanks to 
[~sborda]!

> Identify FITS (Flexible Image Transport System) files
> -
>
> Key: TIKA-874
> URL: https://issues.apache.org/jira/browse/TIKA-874
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Peter May
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.2
>
> Attachments: fits_support.patch
>
>
> Tika does not have a defined signature for application/fits files.  I have 
> created a patch (based on file(1) magic) to address identification of such 
> files, including a simple unit test.
> This patch only handles identification, not parsing of FITS files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2684.
---
Resolution: Not A Problem

Not a Tika problem technically, but definitely an area for us to improve our 
documentation.  Thank you [~sborda]!

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534780#comment-16534780
 ] 

Tim Allison commented on TIKA-2684:
---

W00t! Thank you [~chrismattmann].

 

[~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL]  Go 
Blue(?)!

 

Need testers in the future?! Ha, we always need testers... in the past, present 
and future!

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534780#comment-16534780
 ] 

Tim Allison edited comment on TIKA-2684 at 7/6/18 12:46 PM:


W00t! Thank you [~chrismattmann].

 

[~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL]  Go Blue!

 

Need testers in the future?! Ha, we always need testers... in the past, present 
and future!


was (Author: talli...@mitre.org):
W00t! Thank you [~chrismattmann].

 

[~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL]  Go 
Blue(?)!

 

Need testers in the future?! Ha, we always need testers... in the past, present 
and future!

> Tika does not extract *.fits header text, just file level metadata
> --
>
> Key: TIKA-2684
> URL: https://issues.apache.org/jira/browse/TIKA-2684
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, mime, parser
>Affects Versions: 1.18
>Reporter: Susan
>Priority: Minor
>
> Tika only pull file level metadata for *.fits (flexible image transport 
> system) files, using:
> java -jar tika-app-1.18.jar --gui
> Content-Length: 699840
> Content-Type: application/fits
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
> X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2
> X-TIKA:digest:SHA256: 
> da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3
> resourceName: WFPC2u5780205r_c0fx.fits
> Rather than text from the header (extracted with astropy.py):
> SIMPLE  =    T / file does conform to FITS standard   
>   BITPIX  =  -32 / number of bits per data pixel  
>     NAXIS   =    3 / number of data axes  
>   NAXIS1  =  200 / length of data axis 1  
>     NAXIS2  =  200 / length of data axis 2
>   NAXIS3  =    4 / length of data axis 3  
>     EXTEND  =    T / FITS dataset may contain 
> extensions    COMMENT   FITS (Flexible Image Transport System) format 
> is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; 
> bibcode: 2001A BSCALE  =    1.0E0 / REAL = 
> TAPE*BSCALE + BZERO BZERO   =    0.0E0 /  
>   OPSIZE  = 2112 / 
> PSIZE of original image    ORIGIN  = 'STScI-STSDAS'   
> / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09'   
>   / Date FITS file was created FILENAME= 
> 'u5780205r_cvt.c0h'  / Original filename  
> ALLG-MAX=   3.01E3 / Data max in all groups   
>   ALLG-MIN=  -7.319537E1 / Data min in all groups 
>     ODATTYPE= 'FLOATING'   / Original datatype: Single precision real 
>   SDASMGNU=    4 / Number of groups in original image
>  
> This was capability was mentioned in Tika-874. I'm looking at netCDF 
> files/headers as model for this behaviour. 
> Thank you!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)