[
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608804#comment-14608804
]
Tim Allison edited comment on TIKA-1602 at 6/30/15 6:26 PM:
------------------------------------------------------------
One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1.
No other diffs found.
Ymmv.
What's odd (to me) is that the rfc parser parsed lots and lots of empty
embedded documents, and none of them had any text:
{noformat}
{
"Content-Type": "application/zip",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pkg.PackageParser"
],
"X-TIKA:content": "\n036491.txt\n\n",
"X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9",
"X-TIKA:digest:SHA256":
"96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e",
"X-TIKA:parse_time_millis": "140"
},
{
"Content-Encoding": "US-ASCII",
"Content-Type": "text/plain; charset\u003dUS-ASCII",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5",
"X-TIKA:digest:SHA256":
"9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc",
"X-TIKA:embedded_resource_path":
"embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54"
},
{noformat}
was (Author: [email protected]):
One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1.
Ymmv.
What's odd (to me) is that the rfc parser parsed lots and lots of empty
embedded documents, and none of them had any text:
{noformat}
{
"Content-Type": "application/zip",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pkg.PackageParser"
],
"X-TIKA:content": "\n036491.txt\n\n",
"X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9",
"X-TIKA:digest:SHA256":
"96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e",
"X-TIKA:parse_time_millis": "140"
},
{
"Content-Encoding": "US-ASCII",
"Content-Type": "text/plain; charset\u003dUS-ASCII",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5",
"X-TIKA:digest:SHA256":
"9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc",
"X-TIKA:embedded_resource_path":
"embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54"
},
{noformat}
> Detecting standards-non-compliant emails as message/rfc822
> ----------------------------------------------------------
>
> Key: TIKA-1602
> URL: https://issues.apache.org/jira/browse/TIKA-1602
> Project: Tika
> Issue Type: New Feature
> Reporter: Jeremy B. Merrill
> Assignee: Chris A. Mattmann
> Priority: Minor
> Attachments: 036491.txt.zip
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Tika does not properly detect certain emails as `message/rfc822` if they're
> slightly standards-non-compliant and begin with `Status: ` as the first
> header. I've added `Status: ` as a magic detection line in
> tika-mimetypes.xml.
> This solves my problem and does not appear to cause unit test failures. I
> have not yet run the tika-batch tests.
> As further information, the emails that are processed incorrectly come from
> dumps directly from various US public officials' mailservers. The dumps, I
> believe since they're not intended to be transmitted over the wire, sometimes
> are slightly non-compliant.
> It's important to note that Tika (and the underlying library, James Mime4J)
> do properly *parse* these emails, despite the non-compliant header. The
> problem is getting Tika to *detect* the file as an email so that Mime4J gets
> chosen to parse it.
> Pull request on Github at https://github.com/apache/tika/pull/40
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)