[
https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125545#comment-17125545
]
Nick Burch commented on TIKA-3106:
----------------------------------
This email starts with a series of long {{ARC-}} headers, which means that the
"normal" email headers don't occur until a lot longer in the file than typical
I've added a match for this in {{1e02f0181}}, which allows an ARC signature
header firt to be matched like we already did for DKIM header first. With that
commit in place, your file is then detected with contents only
Can you please give a nightly build / override tika mime types file a try with
your files, and see if any other email first headers are still being missed for
detection?
> Tika Fails to detect some EML files if extension is not .eml
> ------------------------------------------------------------
>
> Key: TIKA-3106
> URL: https://issues.apache.org/jira/browse/TIKA-3106
> Project: Tika
> Issue Type: Bug
> Components: metadata, mime
> Affects Versions: 1.24
> Reporter: Xiaohong Yang
> Priority: Critical
> Attachments: EmlFile.txt
>
>
> I have an eml file that can be detected as message/rfc822 only if the file
> extension is .eml, otherwise it will be detected as text/plain. Following
> is the code that I use to detect the file type and extension.
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> TikaInputStream stream = TikaInputStream.get(fis = new
> FileInputStream(filePath));
> metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);
> MediaType mediaType = detector.detect(stream, metadata);
> MimeType mimeType =
> config.getMimeRepository().forName(mediaType.toString());
> String tikaExtension = mimeType.getExtension();
>
> When the sample file has .eml extension, mimeType is message/rfc822 and
> tikaExtension is eml. When I change the extension to .txt, mimeType is
> text/plain and tikaExtension is .txt.
>
> The same mimeType and tikaExtension should be detected regardless the file
> extension.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)