[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-879:
-
Attachment: mbox_email_section.txt

As described in TIKA-2042, the attached file [^mbox_email_section.txt] contains 
a section of an MBOX file, itself containing a message stream which is detected 
as text/html instead of message/rfc822, even though the correct mimetype is set 
on the Metadata object by the MBOXParser.

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: mbox_email_section.txt, mime_diffs_A_to_B.html, 
> TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-03-01 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-879:
-
Labels: new-parser  (was: )

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2014-12-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated TIKA-879:
-
Attachment: TIKA-879-thunderbird.eml

I've run into the same problem with an .eml file written by Thunderbird (see 
attachment).

RFC822 states (http://tools.ietf.org/html/rfc822#section-4.1) that header 
fields can appear in any order:
{quote}
Note: Due to an artifact of the notational conventions, the syntax indicates 
that, when present, some fields, must be in a particular order.  Header fields 
are NOT required to occur in any particular order, except that the message body 
must occur AFTER the headers.
{quote}
If one of the optional fields (according to RFC822), esp. extension-field 
(X-...) or any user-defined-field, is the first field in the header the 
mime magic does not work. 

Adding {{sub-class-of type=text/plain/}} would solve the problem only 
partially: if any text file is named *.eml, it is always recognized as 
message/rfc822 independent from its content. Is the file name/extension a 
strong indicator?

Or would it be possible to relax the MIME magic and allow additional header 
fields at the beginning?
* check for the {{field: value}} structure first
* then check for (some) required fields (Date:, From:) but also if not 
immediately at beginning


 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)