[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-879: - Attachment: mbox_email_section.txt As described in TIKA-2042, the attached file [^mbox_email_section.txt] contains a section of an MBOX file, itself containing a message stream which is detected as text/html instead of message/rfc822, even though the correct mimetype is set on the Metadata object by the MBOXParser. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: mbox_email_section.txt, mime_diffs_A_to_B.html, > TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-879: - Labels: new-parser (was: ) Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov Labels: new-parser Attachments: TIKA-879-thunderbird.eml When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated TIKA-879: - Attachment: TIKA-879-thunderbird.eml I've run into the same problem with an .eml file written by Thunderbird (see attachment). RFC822 states (http://tools.ietf.org/html/rfc822#section-4.1) that header fields can appear in any order: {quote} Note: Due to an artifact of the notational conventions, the syntax indicates that, when present, some fields, must be in a particular order. Header fields are NOT required to occur in any particular order, except that the message body must occur AFTER the headers. {quote} If one of the optional fields (according to RFC822), esp. extension-field (X-...) or any user-defined-field, is the first field in the header the mime magic does not work. Adding {{sub-class-of type=text/plain/}} would solve the problem only partially: if any text file is named *.eml, it is always recognized as message/rfc822 independent from its content. Is the file name/extension a strong indicator? Or would it be possible to relax the MIME magic and allow additional header fields at the beginning? * check for the {{field: value}} structure first * then check for (some) required fields (Date:, From:) but also if not immediately at beginning Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov Attachments: TIKA-879-thunderbird.eml When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)