Re: image recognition...how do the parts play together?
Yes, there is a big reason. It’s b/c you don’t have to have an external server running to use it with tika-dl. And of course you can static analyze the code (which you have to mix languages for that with the other solution), etc. So yes, we should keep them both… From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Friday, July 6, 2018 at 4:30 PM To: "dev@tika.apache.org" Subject: Re: image recognition...how do the parts play together? This is very helpful. Thank you! Is there any use in having the tika-dl module if our more modern approach is REST + Docker? The upkeep in tika-dl is nontrivial. On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann wrote: Tim, Thanks. There are multiple modes of integrating deep learning with Tika: The original mode: uses Thamme’s work on REST exposing Tensorflow and Docker to provide a REST Service to Tika to allow for running Tensorflow DL models. We initially did Inception_v3, and a model by Madhav Sharan that combines OpenCV with Inception v3 (and a new docker that installs OpenCV it’s a pain) for image and video object recognition, respectively. See: https://github.com/apache/tika/pull/208 and https://github.com/apache/tika/pull/168 and also the wiki Later, Thamme, Avtar Singh, KranthiGV, added DL4J support: https://github.com/apache/tika/pull/165 including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182 This houses the model in USC Data science repo and uses it as an example for how to store and load models from Keras/Python into DL4j: https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data Then, Thejan added Text Captioning and a new Docker, and trained model: https://github.com/apache/tika/pull/180 Then Raunaq from UPenn added Inception v4 support via the Docker/Tensorflow way: https://github.com/apache/tika/pull/162 All this Docker work caused Thejan and others to think we needed to refactor the dockers. We did that here: https://github.com/apache/tika/pull/208 to make them cleaner, and to depend on: http://github.com/USCDataScience/tika-dockers/ and on http://github.com/USCDataScience/img2text models for image captioning. Now, Video and Image recognition and Image Captioning all had the same base docker and sub dockers from that. That’s where we’re at today. Make sense? ☺ Thejan and others want to add more DL4J supported models and we can always use Tensorflow/Docker as well as a way of doing it. Cheers, Chris From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Friday, July 6, 2018 at 2:39 PM To: "dev@tika.apache.org" Subject: image recognition...how do the parts play together? On Twitter, Chris, Thamme, Thejan, and I are working with some deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA (TIKA-2672). I initially requested help from Thejan (and Thamme :D) for this because we were getting an initialization exception after the upgrade in tika-dl's DL4JInceptionV3Net. According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding the TensorFlowRESTRecogniser...does this mean we can get rid of DL4JInceptionV3Net? Or, what are we actually asking the dl4j folks to help with? How do these recognizers play together? Thank you. Cheers, Tim [1] e.g. https://twitter.com/chrismattmann/status/1015340483923439617 [2] https://wiki.apache.org/tika/TikaAndVision
Re: image recognition...how do the parts play together?
This is very helpful. Thank you! Is there any use in having the tika-dl module if our more modern approach is REST + Docker? The upkeep in tika-dl is nontrivial. On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann wrote: > Tim, > > > > Thanks. There are multiple modes of integrating deep learning with Tika: > > > The original mode: uses Thamme’s work on REST exposing Tensorflow > and Docker to provide a REST Service to Tika to allow for running > Tensorflow > DL models. We initially did Inception_v3, and a model by Madhav Sharan > that combines OpenCV > with Inception v3 (and a new docker that installs OpenCV it’s a pain) for > image > and video object recognition, respectively. See: > https://github.com/apache/tika/pull/208 > and https://github.com/apache/tika/pull/168 and also the wiki > Later, Thamme, Avtar Singh, KranthiGV, added DL4J support: > https://github.com/apache/tika/pull/165 > including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182 > This houses the model in USC Data science repo and uses it as an example > for how to store and load models from Keras/Python into DL4j: > > https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data > Then, Thejan added Text Captioning and a new Docker, and trained model: > https://github.com/apache/tika/pull/180 > Then Raunaq from UPenn added Inception v4 support via the > Docker/Tensorflow way: > https://github.com/apache/tika/pull/162 > All this Docker work caused Thejan and others to think we needed to > refactor the dockers. We did > that here: https://github.com/apache/tika/pull/208 to make them cleaner, > and to depend on: > http://github.com/USCDataScience/tika-dockers/ and on > http://github.com/USCDataScience/img2text > models for image captioning. Now, Video and Image recognition and Image > Captioning all had the same > base docker and sub dockers from that. > > > That’s where we’re at today. Make sense? ☺ Thejan and others want to add > more DL4J supported models > and we can always use Tensorflow/Docker as well as a way of doing it. > > > > Cheers, > > Chris > > > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > Date: Friday, July 6, 2018 at 2:39 PM > To: "dev@tika.apache.org" > Subject: image recognition...how do the parts play together? > > > > On Twitter, Chris, Thamme, Thejan, and I are working with some > > deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA > > (TIKA-2672). > > > > I initially requested help from Thejan (and Thamme :D) for this because we > > were getting an initialization exception after the upgrade in tika-dl's > > DL4JInceptionV3Net. > > > > According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding > > the TensorFlowRESTRecogniser...does this mean we can get rid of > > DL4JInceptionV3Net? Or, what are we actually asking the dl4j folks to help > > with? > > > > How do these recognizers play together? > > > > Thank you. > > > > Cheers, > > > > Tim > > > > [1] e.g. https://twitter.com/chrismattmann/status/1015340483923439617 > > [2] https://wiki.apache.org/tika/TikaAndVision > > > >
Re: image recognition...how do the parts play together?
Tim, Thanks. There are multiple modes of integrating deep learning with Tika: The original mode: uses Thamme’s work on REST exposing Tensorflow and Docker to provide a REST Service to Tika to allow for running Tensorflow DL models. We initially did Inception_v3, and a model by Madhav Sharan that combines OpenCV with Inception v3 (and a new docker that installs OpenCV it’s a pain) for image and video object recognition, respectively. See: https://github.com/apache/tika/pull/208 and https://github.com/apache/tika/pull/168 and also the wiki Later, Thamme, Avtar Singh, KranthiGV, added DL4J support: https://github.com/apache/tika/pull/165 including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182 This houses the model in USC Data science repo and uses it as an example for how to store and load models from Keras/Python into DL4j: https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data Then, Thejan added Text Captioning and a new Docker, and trained model: https://github.com/apache/tika/pull/180 Then Raunaq from UPenn added Inception v4 support via the Docker/Tensorflow way: https://github.com/apache/tika/pull/162 All this Docker work caused Thejan and others to think we needed to refactor the dockers. We did that here: https://github.com/apache/tika/pull/208 to make them cleaner, and to depend on: http://github.com/USCDataScience/tika-dockers/ and on http://github.com/USCDataScience/img2text models for image captioning. Now, Video and Image recognition and Image Captioning all had the same base docker and sub dockers from that. That’s where we’re at today. Make sense? ☺ Thejan and others want to add more DL4J supported models and we can always use Tensorflow/Docker as well as a way of doing it. Cheers, Chris From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Friday, July 6, 2018 at 2:39 PM To: "dev@tika.apache.org" Subject: image recognition...how do the parts play together? On Twitter, Chris, Thamme, Thejan, and I are working with some deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA (TIKA-2672). I initially requested help from Thejan (and Thamme :D) for this because we were getting an initialization exception after the upgrade in tika-dl's DL4JInceptionV3Net. According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding the TensorFlowRESTRecogniser...does this mean we can get rid of DL4JInceptionV3Net? Or, what are we actually asking the dl4j folks to help with? How do these recognizers play together? Thank you. Cheers, Tim [1] e.g. https://twitter.com/chrismattmann/status/1015340483923439617 [2] https://wiki.apache.org/tika/TikaAndVision
[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2680: -- Attachment: main_email_in_outlook.jpg > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535412#comment-16535412 ] Tim Allison commented on TIKA-2680: --- Given that Outlook appears to treat this as an attachment, are you ok if we do the same? !main_email_in_outlook.jpg! > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2680: -- Attachment: (was: main_email_in_outlook.jpg) > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2680: -- Attachment: main_email_in_outlook.jpg > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
image recognition...how do the parts play together?
On Twitter, Chris, Thamme, Thejan, and I are working with some deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA (TIKA-2672). I initially requested help from Thejan (and Thamme :D) for this because we were getting an initialization exception after the upgrade in tika-dl's DL4JInceptionV3Net. According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding the TensorFlowRESTRecogniser...does this mean we can get rid of DL4JInceptionV3Net? Or, what are we actually asking the dl4j folks to help with? How do these recognizers play together? Thank you. Cheers, Tim [1] e.g. https://twitter.com/chrismattmann/status/1015340483923439617 [2] https://wiki.apache.org/tika/TikaAndVision
[jira] [Comment Edited] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535351#comment-16535351 ] Yury Kats edited comment on TIKA-2680 at 7/6/18 9:07 PM: - Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. In this case, the original message is wrapped in another message that can provide additional headers, such as Bcc and expanded distribution lists. was (Author: yurykats): Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535351#comment-16535351 ] Yury Kats commented on TIKA-2680: - Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535339#comment-16535339 ] Tim Allison commented on TIKA-2680: --- Something like this? {noformat} multipart/mixed (uses _728aa617-16cf-4d95-8bc2-9f1868397202_) text/plain (_728aa617-16cf-4d95-8bc2-9f1868397202_) (sender and some other headers, no real content "Message-Id: <0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>") message/rfc822 (_728aa617-16cf-4d95-8bc2-9f1868397202_) uses (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) multipart/alternative (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) uses (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) text/plain (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) text/html (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) ("cocacola Henry van der Smith") end _000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_ message/rfc822 (content-disposition: attachment) (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) uses (_004_8075737674787666767166806676697476787366657271727266777_) multipart/alternative (_004_8075737674787666767166806676697476787366657271727266777_) uses (_000_8075737674787666767166806676697476787366657271727266777_) text/plain (_000_8075737674787666767166806676697476787366657271727266777_) text/html (_000_8075737674787666767166806676697476787366657271727266777_) ("Cocacola test Henry van der Smith") end _000_8075737674787666767166806676697476787366657271727266777_ message/rfc822 (content-disposition: attachment text/plain) (004_8075737674787666767166806676697476787366657271727266777) no multipart body, just plain text: ("I won't be able to attend") end _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_ end _728aa617-16cf-4d95-8bc2-9f1868397202_ {noformat} As you point out...it is mildly odd (to me at least) that the first embedded rfc822 (the one that uses _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom) does not have content-disposition: attachment. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ >
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535296#comment-16535296 ] Yury Kats commented on TIKA-2685: - Yes, correct, this is govern by RFC 3642, sorry I didn't mention this upfront. > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535288#comment-16535288 ] Tim Allison commented on TIKA-2685: --- https://tools.ietf.org/html/rfc3462 page 2 describes exactly this...yay! With 1) and 2) both [Required] and 3) the original rfc822 being [Optional] > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271 ] Yury Kats edited comment on TIKA-2685 at 7/6/18 8:03 PM: - delivery-status and message/rfc822 are inside multipart/report {noformat} multipart/report multipart/alternative text/plain text/html end message/delivery-status message/rfc822 end {noformat} was (Author: yurykats): delivery-status and message/rfc822 are inside multipart/report {noformat} multipart/report multipart/alternative text/plain text/html end message/delivery-status message/rfc822 end {noformat} > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535275#comment-16535275 ] Tim Allison commented on TIKA-2685: --- I think I agree...the first rfc822 (multipart/report) has three parts: 1. multipart/alternative 2. message/delivery-status 3. message/rfc822 (the original eml) > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271 ] Yury Kats edited comment on TIKA-2685 at 7/6/18 8:02 PM: - delivery-status and message/rfc822 are inside multipart/report {noformat} multipart/report multipart/alternative text/plain text/html end message/delivery-status message/rfc822 end {noformat} was (Author: yurykats): delivery-status and message/rfc822 are inside multipart/report > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535267#comment-16535267 ] Tim Allison edited comment on TIKA-2685 at 7/6/18 8:00 PM: --- Is this your understanding of the structure? {noformat} multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_) text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) message/rfc822 (multipart/report::report-type=delivery-status) (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) (uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) (uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_) text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_) text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) text/html (end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) {noformat} was (Author: talli...@mitre.org): Is this your understanding of the structure? {noformat} multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_) text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) message/rfc822 (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) (uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) (uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_) text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_) text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) text/html (end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) {noformat} > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", >
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535271#comment-16535271 ] Yury Kats commented on TIKA-2685: - delivery-status and message/rfc822 are inside multipart/report > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535267#comment-16535267 ] Tim Allison commented on TIKA-2685: --- Is this your understanding of the structure? {noformat} multipart/mixed (uses: _5a8d7320-7cd6-4c1b-8e30-9616634562b2_) text/plain/ (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) message/rfc822 (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) (uses: _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) multipart/alternative (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) (uses: _a9a8211e-47bf-4c8f-904b-a82970f79951_) text/plain (_a9a8211e-47bf-4c8f-904b-a82970f79951_) text/html (end_a9a8211e-47bf-4c8f-904b-a82970f79951_) message/delivery-status (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) message/rfc822 (_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) text/html (end _dd8c2c7d-5333-4f9a-a282-d2056075e7aa_) end (_5a8d7320-7cd6-4c1b-8e30-9616634562b2_) {noformat} > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", >
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535147#comment-16535147 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #56 (See [https://builds.apache.org/job/tika-branch-1x/56/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: [https://github.com/apache/tika/commit/525889a4f928d1d448c6aaf6b1ddc19081e07404]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535145#comment-16535145 ] Hudson commented on TIKA-2673: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1517 (See [https://builds.apache.org/job/Tika-trunk/1517/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: [https://github.com/apache/tika/commit/790c1248207371e6cb2a3e7a1ec3a021503ec7a4]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535135#comment-16535135 ] Yury Kats commented on TIKA-2685: - For my own immediate needs, I modified MimeStreamParser to call ContentHandler#startMessage with the stream, ie instead of {code} handler.startMessage() {code} I call {code} handler.startMessage(mimeTokenStream.getInputStream()) {code} I then modified the MailContentHandler to have startMessage(InputStream is) method where I check that it's inside of "multipart/report" and then invoke handleEmbedded {code} public void startMessage(InputStream is) throws MimeException { boolean attachedMessage = parts.size() > 0 && parts.peek().getMimeType().equals("multipart/report"); if (!attachedMessage) { startMessage(); } else { Metadata submd = new Metadata(); submd.set(Metadata.CONTENT_TYPE, "message/rfc822"); submd.set(Metadata.CONTENT_DISPOSITION, "attachment"); try (TikaInputStream tis = TikaInputStream.get(is)) { handleEmbedded(tis, submd); } catch (IOException e) { throw new MimeException(e); } } } {code} I've made a similar change for TIKA-2680, only there I detect that the message is an attachment in MimeEntity#advance, and then end up calling the new startMessage with the stream and then handleEmbedded. Not sure if these are the best ways of solving these issues though. Looking forward to your take on them. > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path":
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535130#comment-16535130 ] Hudson commented on TIKA-2673: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #282 (See [https://builds.apache.org/job/tika-2.x-windows/282/]) TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard (tallison: rev 790c1248207371e6cb2a3e7a1ec3a021503ec7a4) * (add) tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv * (add) tika-parsers/src/test/java/org/apache/tika/parser/html/StrictHtmlEncodingDetectorTest.java > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535100#comment-16535100 ] Tim Allison commented on TIKA-2685: --- [~yurykats], thank you for identifying this problem and TIKA-2680 and sharing them with us! I have a couple of other things on my plate, but I _hope_ to turn to these fairly soon (next week or so?). Your example files and diagnoses are very helpful. Thank you! > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TIKA-2685) Email attached to an undeliverable email report are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-2685: - Assignee: Tim Allison > Email attached to an undeliverable email report are not extracted > - > > Key: TIKA-2685 > URL: https://issues.apache.org/jira/browse/TIKA-2685 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: undeliverable.eml > > > I have a number of email messages that are reports of deliverable emails that > contain the original email message as attachment. > The original emails are parts with "Content-Type: message/rfc822" but are not > being recognized as such. > Attached is an example email: > * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp > ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp > > I would like to see 2 separate emails parsed out (top level undeliverable > report, 1st level attached original email), but I get 1 email and 2 unnamed > text attachments: > {noformat} > $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m > json.tool > [ > { > "Author": "postmas...@bank.com", > "Content-Length": "17356", > "Content-Type": "message/rfc822", > "Creation-Date": "2017-11-04T16:00:11Z", > "Message-From": "postmas...@bank.com", > "Message-To": "uatalert...@logscape.com", > "Message:From-Email": "postmas...@bank.com", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal > Agent", > "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "326", > "creator": "postmas...@bank.com", > "dc:creator": "postmas...@bank.com", > "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp", > "dcterms:created": "2017-11-04T16:00:11Z", > "meta:author": "postmas...@bank.com", > "meta:creation-date": "2017-11-04T16:00:11Z", > "resourceName": "undeliverable.eml", > "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp" > }, > { > "Content-Encoding": "windows-1252", > "Content-Type": "text/plain; charset=windows-1252", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "4", > "embeddedResourceType": "ATTACHMENT" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/html; charset=US-ASCII", > "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_", > "Multipart-Subtype": "report", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-2", > "X-TIKA:parse_time_millis": "7", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535041#comment-16535041 ] Tim Allison commented on TIKA-2673: --- I've added this to both 'master' and 'branch_1x'. Let me know if you disagree with this or would like to make modifications. > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tika 1.19?
Once tika-dl works again with Inception v4, I’m good ☺ I’m working on adding some more models to tika-dl and other things but those can come after 1.19. Cheers, Chris From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Friday, July 6, 2018 at 8:40 AM To: "dev@tika.apache.org" Subject: Tika 1.19? All, We've made quite a few improvements, what would you think of starting the release process in a couple of weeks...say, July 23ish? I'd like to complete the dl4j upgrade and update some of our dependencies so that we can at least build with Java 11. Any blockers or other things people want to get in? Cheers, Tim
Tika 1.19?
All, We've made quite a few improvements, what would you think of starting the release process in a couple of weeks...say, July 23ish? I'd like to complete the dl4j upgrade and update some of our dependencies so that we can at least build with Java 11. Any blockers or other things people want to get in? Cheers, Tim
[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta
[ https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534994#comment-16534994 ] Tim Allison commented on TIKA-2672: --- Fantastic! Thank you [~ThejanWijesinghe]! > Upgrade dl4j to 1.0.0-beta > -- > > Key: TIKA-2672 > URL: https://issues.apache.org/jira/browse/TIKA-2672 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: TIKA-2672.patch > > > Let's try to upgrade dl4j. I think I got us most of the way there, but I got > this error when reading the json config file. Can someone with more > knowledge of layer specs help ([~thammegowda], perhaps :))? > {noformat} > org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid > configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) > for width dimension: Invalid input configuration for kernel width. Require 0 > < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0) > Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides > = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution > mode = Truncate > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2672) Upgrade dl4j to 1.0.0-beta
[ https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534994#comment-16534994 ] Tim Allison edited comment on TIKA-2672 at 7/6/18 3:30 PM: --- Fantastic! Thank you [~ThejanWijesinghe]! And, y, I spent some time tweaking some of the parameters, but with no luck. I think we need expert help. :) was (Author: talli...@mitre.org): Fantastic! Thank you [~ThejanWijesinghe]! > Upgrade dl4j to 1.0.0-beta > -- > > Key: TIKA-2672 > URL: https://issues.apache.org/jira/browse/TIKA-2672 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: TIKA-2672.patch > > > Let's try to upgrade dl4j. I think I got us most of the way there, but I got > this error when reading the json config file. Can someone with more > knowledge of layer specs help ([~thammegowda], perhaps :))? > {noformat} > org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid > configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) > for width dimension: Invalid input configuration for kernel width. Require 0 > < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0) > Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides > = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution > mode = Truncate > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification
[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534992#comment-16534992 ] Tim Allison commented on TIKA-2673: --- [~gbouchar], thank you for contributing this! I won't have time to run the regression tests any time soon. Would you be ok if I added your StrictHtmlEncodingDetector to Tika now? Users would then be able to configure Tika to use it via tika-config.xml. If you're ok with this, is it ok if I add the Apache Software License 2.0 headers to your main class, test class and .tsv files? Thank you, again! > HtmlEncodingDetector doesn't follow the specification > - > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files
[ https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534990#comment-16534990 ] Hudson commented on TIKA-2675: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #55 (See [https://builds.apache.org/job/tika-branch-1x/55/]) TIKA-2675 OpenDocumentParser should fail on invalid zip files - throw (snagel: [https://github.com/apache/tika/commit/def58f6e84de61031212ae40714e02f05d3c19fc]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java * (add) tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt > OpenDocumentParser should fail on invalid zip files > --- > > Key: TIKA-2675 > URL: https://issues.apache.org/jira/browse/TIKA-2675 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > > The OpenDocumentParser assumes a zip file as container. However, if it is > called on an invalid zip stream from a remote URL (see NUTCH-2603), the > parser signals success and returns a document with no/empty content. The > behavior is different when called on a local file: while the [constructor of > ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-] > fails on invalid input, the [constructor of > ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-] > silently ignores the input. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta
[ https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534984#comment-16534984 ] Chris A. Mattmann commented on TIKA-2672: - GREAT WORK [~ThejanWijesinghe] thanks my guy > Upgrade dl4j to 1.0.0-beta > -- > > Key: TIKA-2672 > URL: https://issues.apache.org/jira/browse/TIKA-2672 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: TIKA-2672.patch > > > Let's try to upgrade dl4j. I think I got us most of the way there, but I got > this error when reading the json config file. Can someone with more > knowledge of layer specs help ([~thammegowda], perhaps :))? > {noformat} > org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid > configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) > for width dimension: Invalid input configuration for kernel width. Require 0 > < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0) > Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides > = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution > mode = Truncate > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2672) Upgrade dl4j to 1.0.0-beta
[ https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534966#comment-16534966 ] Thejan Wijesinghe commented on TIKA-2672: - [~talli...@apache.org] sorry for the delay, so the dl4j update works fine for vgg, but not with inception v3. Spent almost two days trying to fix this. First, I updated the keras 1-inception v3 model with keras 2-inception v3, but couldn't get this working, then I assumed this should be a problem related to the dynamic input shapes(None shapes) of the model, so changed the entire model to static shapes starting from the input layer's change to (None, None, 3) to (299, 299, 3) but still no luck, I didn't see any breaking changes of their api either, so filed an issue today [https://github.com/deeplearning4j/deeplearning4j/issues/5831] , they have ignored the inception v3 test in their tests too. Anyway if they are targeting to support full backward capability, the model import api should work with a keras-1 model as well. Let's wait for their instructions. > Upgrade dl4j to 1.0.0-beta > -- > > Key: TIKA-2672 > URL: https://issues.apache.org/jira/browse/TIKA-2672 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: TIKA-2672.patch > > > Let's try to upgrade dl4j. I think I got us most of the way there, but I got > this error when reading the json config file. Can someone with more > knowledge of layer specs help ([~thammegowda], perhaps :))? > {noformat} > org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid > configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer) > for width dimension: Invalid input configuration for kernel width. Require 0 > < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0) > Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides > = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution > mode = Truncate > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files
[ https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534858#comment-16534858 ] Hudson commented on TIKA-2675: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1516 (See [https://builds.apache.org/job/Tika-trunk/1516/]) TIKA-2675 -- OpenDocumentParser should fail on invalid zip via Sebastian (tallison: [https://github.com/apache/tika/commit/c9a81a400ee10e9342bbfe718d62f0b0d6c7944f]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java * (add) tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt * (edit) tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java > OpenDocumentParser should fail on invalid zip files > --- > > Key: TIKA-2675 > URL: https://issues.apache.org/jira/browse/TIKA-2675 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > > The OpenDocumentParser assumes a zip file as container. However, if it is > called on an invalid zip stream from a remote URL (see NUTCH-2603), the > parser signals success and returns a document with no/empty content. The > behavior is different when called on a local file: while the [constructor of > ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-] > fails on invalid input, the [constructor of > ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-] > silently ignores the input. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2675) OpenDocumentParser should fail on invalid zip files
[ https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2675. --- Resolution: Fixed Fix Version/s: 2.0.0 1.19 Thank you [~wastl-nagel]! > OpenDocumentParser should fail on invalid zip files > --- > > Key: TIKA-2675 > URL: https://issues.apache.org/jira/browse/TIKA-2675 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.19, 2.0.0 > > > The OpenDocumentParser assumes a zip file as container. However, if it is > called on an invalid zip stream from a remote URL (see NUTCH-2603), the > parser signals success and returns a document with no/empty content. The > behavior is different when called on a local file: while the [constructor of > ZipFile|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipFile.html#ZipFile-java.io.File-] > fails on invalid input, the [constructor of > ZipInputStream|https://docs.oracle.com/javase/8/docs/api/java/util/zip/ZipInputStream.html#ZipInputStream-java.io.InputStream-] > silently ignores the input. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2675) OpenDocumentParser should fail on invalid zip files
[ https://issues.apache.org/jira/browse/TIKA-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534789#comment-16534789 ] ASF GitHub Bot commented on TIKA-2675: -- tballison closed pull request #240: TIKA-2675 OpenDocumentParser should fail on invalid zip files URL: https://github.com/apache/tika/pull/240 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java index e7eb76ab8..7ed77fbc1 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java @@ -26,6 +26,7 @@ import java.util.HashSet; import java.util.Set; import java.util.zip.ZipEntry; +import java.util.zip.ZipException; import java.util.zip.ZipFile; import java.util.zip.ZipInputStream; @@ -175,10 +176,13 @@ public void parse( private void handleZipStream(ZipInputStream zipStream, Metadata metadata, ParseContext context, EndDocumentShieldingContentHandler handler) throws IOException, TikaException, SAXException { ZipEntry entry = zipStream.getNextEntry(); -while (entry != null) { + if (entry == null) { + throw new IOException("No entries found in ZipInputStream"); + } +do { handleZipEntry(entry, zipStream, metadata, context, handler); entry = zipStream.getNextEntry(); -} +} while (entry != null); } private void handleZipFile(ZipFile zipFile, Metadata metadata, diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java index d1ec46e94..ba6c0ca11 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java @@ -19,6 +19,7 @@ import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; +import java.io.IOException; import java.io.InputStream; import java.util.List; @@ -406,6 +407,28 @@ public void testEmbedded() throws Exception { assertEquals(3, metadataList.size()); } +@Test(expected = IOException.class) +public void testInvalidFromStream() throws Exception { +try (InputStream is = this.getClass().getResource( +"/test-documents/testODTnotaZipFile.odt").openStream()) { +OpenDocumentParser parser = new OpenDocumentParser(); +Metadata metadata = new Metadata(); +ContentHandler handler = new BodyContentHandler(); +parser.parse(is, handler, metadata, new ParseContext()); +} +} + +@Test(expected = IOException.class) +public void testInvalidFromFile() throws Exception { +try (TikaInputStream tis = TikaInputStream.get(this.getClass().getResource( +"/test-documents/testODTnotaZipFile.odt"))) { +OpenDocumentParser parser = new OpenDocumentParser(); +Metadata metadata = new Metadata(); +ContentHandler handler = new BodyContentHandler(); +parser.parse(tis, handler, metadata, new ParseContext()); +} +} + private ParseContext getNonRecursingParseContext() { ParseContext parseContext = new ParseContext(); parseContext.set(Parser.class, new EmptyParser()); diff --git a/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt b/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt new file mode 100644 index 0..9c1d376f0 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testODTnotaZipFile.odt @@ -0,0 +1 @@ +This is not a zip file! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > OpenDocumentParser should fail on invalid zip files > --- > > Key: TIKA-2675 > URL: https://issues.apache.org/jira/browse/TIKA-2675 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > > The OpenDocumentParser assumes a zip file as container. However, if it
[jira] [Commented] (TIKA-874) Identify FITS (Flexible Image Transport System) files
[ https://issues.apache.org/jira/browse/TIKA-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534784#comment-16534784 ] Tim Allison commented on TIKA-874: -- See TIKA-2684 for how to configure GDAL to parse FITS...many thanks to [~sborda]! > Identify FITS (Flexible Image Transport System) files > - > > Key: TIKA-874 > URL: https://issues.apache.org/jira/browse/TIKA-874 > Project: Tika > Issue Type: Improvement > Components: mime >Reporter: Peter May >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.2 > > Attachments: fits_support.patch > > > Tika does not have a defined signature for application/fits files. I have > created a patch (based on file(1) magic) to address identification of such > files, including a simple unit test. > This patch only handles identification, not parsing of FITS files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata
[ https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2684. --- Resolution: Not A Problem Not a Tika problem technically, but definitely an area for us to improve our documentation. Thank you [~sborda]! > Tika does not extract *.fits header text, just file level metadata > -- > > Key: TIKA-2684 > URL: https://issues.apache.org/jira/browse/TIKA-2684 > Project: Tika > Issue Type: Improvement > Components: metadata, mime, parser >Affects Versions: 1.18 >Reporter: Susan >Priority: Minor > > Tika only pull file level metadata for *.fits (flexible image transport > system) files, using: > java -jar tika-app-1.18.jar --gui > Content-Length: 699840 > Content-Type: application/fits > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.gdal.GDALParser > X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2 > X-TIKA:digest:SHA256: > da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3 > resourceName: WFPC2u5780205r_c0fx.fits > Rather than text from the header (extracted with astropy.py): > SIMPLE = T / file does conform to FITS standard > BITPIX = -32 / number of bits per data pixel > NAXIS = 3 / number of data axes > NAXIS1 = 200 / length of data axis 1 > NAXIS2 = 200 / length of data axis 2 > NAXIS3 = 4 / length of data axis 3 > EXTEND = T / FITS dataset may contain > extensions COMMENT FITS (Flexible Image Transport System) format > is defined in 'AstronomyCOMMENT and Astrophysics', volume 376, page 359; > bibcode: 2001A BSCALE = 1.0E0 / REAL = > TAPE*BSCALE + BZERO BZERO = 0.0E0 / > OPSIZE = 2112 / > PSIZE of original image ORIGIN = 'STScI-STSDAS' > / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09' > / Date FITS file was created FILENAME= > 'u5780205r_cvt.c0h' / Original filename > ALLG-MAX= 3.01E3 / Data max in all groups > ALLG-MIN= -7.319537E1 / Data min in all groups > ODATTYPE= 'FLOATING' / Original datatype: Single precision real > SDASMGNU= 4 / Number of groups in original image > > This was capability was mentioned in Tika-874. I'm looking at netCDF > files/headers as model for this behaviour. > Thank you! > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata
[ https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534780#comment-16534780 ] Tim Allison commented on TIKA-2684: --- W00t! Thank you [~chrismattmann]. [~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL] Go Blue(?)! Need testers in the future?! Ha, we always need testers... in the past, present and future! > Tika does not extract *.fits header text, just file level metadata > -- > > Key: TIKA-2684 > URL: https://issues.apache.org/jira/browse/TIKA-2684 > Project: Tika > Issue Type: Improvement > Components: metadata, mime, parser >Affects Versions: 1.18 >Reporter: Susan >Priority: Minor > > Tika only pull file level metadata for *.fits (flexible image transport > system) files, using: > java -jar tika-app-1.18.jar --gui > Content-Length: 699840 > Content-Type: application/fits > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.gdal.GDALParser > X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2 > X-TIKA:digest:SHA256: > da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3 > resourceName: WFPC2u5780205r_c0fx.fits > Rather than text from the header (extracted with astropy.py): > SIMPLE = T / file does conform to FITS standard > BITPIX = -32 / number of bits per data pixel > NAXIS = 3 / number of data axes > NAXIS1 = 200 / length of data axis 1 > NAXIS2 = 200 / length of data axis 2 > NAXIS3 = 4 / length of data axis 3 > EXTEND = T / FITS dataset may contain > extensions COMMENT FITS (Flexible Image Transport System) format > is defined in 'AstronomyCOMMENT and Astrophysics', volume 376, page 359; > bibcode: 2001A BSCALE = 1.0E0 / REAL = > TAPE*BSCALE + BZERO BZERO = 0.0E0 / > OPSIZE = 2112 / > PSIZE of original image ORIGIN = 'STScI-STSDAS' > / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09' > / Date FITS file was created FILENAME= > 'u5780205r_cvt.c0h' / Original filename > ALLG-MAX= 3.01E3 / Data max in all groups > ALLG-MIN= -7.319537E1 / Data min in all groups > ODATTYPE= 'FLOATING' / Original datatype: Single precision real > SDASMGNU= 4 / Number of groups in original image > > This was capability was mentioned in Tika-874. I'm looking at netCDF > files/headers as model for this behaviour. > Thank you! > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2684) Tika does not extract *.fits header text, just file level metadata
[ https://issues.apache.org/jira/browse/TIKA-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534780#comment-16534780 ] Tim Allison edited comment on TIKA-2684 at 7/6/18 12:46 PM: W00t! Thank you [~chrismattmann]. [~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL] Go Blue! Need testers in the future?! Ha, we always need testers... in the past, present and future! was (Author: talli...@mitre.org): W00t! Thank you [~chrismattmann]. [~sborda], I updated our wiki: [https://wiki.apache.org/tika/TikaGDAL] Go Blue(?)! Need testers in the future?! Ha, we always need testers... in the past, present and future! > Tika does not extract *.fits header text, just file level metadata > -- > > Key: TIKA-2684 > URL: https://issues.apache.org/jira/browse/TIKA-2684 > Project: Tika > Issue Type: Improvement > Components: metadata, mime, parser >Affects Versions: 1.18 >Reporter: Susan >Priority: Minor > > Tika only pull file level metadata for *.fits (flexible image transport > system) files, using: > java -jar tika-app-1.18.jar --gui > Content-Length: 699840 > Content-Type: application/fits > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.gdal.GDALParser > X-TIKA:digest:MD5: d93e8f4654902c45c7f3e4f4bf5f63e2 > X-TIKA:digest:SHA256: > da7c0f1b6643850856cba100e9b3e8db76b80e91583eb088635c416a2b4161b3 > resourceName: WFPC2u5780205r_c0fx.fits > Rather than text from the header (extracted with astropy.py): > SIMPLE = T / file does conform to FITS standard > BITPIX = -32 / number of bits per data pixel > NAXIS = 3 / number of data axes > NAXIS1 = 200 / length of data axis 1 > NAXIS2 = 200 / length of data axis 2 > NAXIS3 = 4 / length of data axis 3 > EXTEND = T / FITS dataset may contain > extensions COMMENT FITS (Flexible Image Transport System) format > is defined in 'AstronomyCOMMENT and Astrophysics', volume 376, page 359; > bibcode: 2001A BSCALE = 1.0E0 / REAL = > TAPE*BSCALE + BZERO BZERO = 0.0E0 / > OPSIZE = 2112 / > PSIZE of original image ORIGIN = 'STScI-STSDAS' > / Fitsio version 21-Feb-1996 FITSDATE= '2004-01-09' > / Date FITS file was created FILENAME= > 'u5780205r_cvt.c0h' / Original filename > ALLG-MAX= 3.01E3 / Data max in all groups > ALLG-MIN= -7.319537E1 / Data min in all groups > ODATTYPE= 'FLOATING' / Original datatype: Single precision real > SDASMGNU= 4 / Number of groups in original image > > This was capability was mentioned in Tika-874. I'm looking at netCDF > files/headers as model for this behaviour. > Thank you! > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)