[
https://issues.apache.org/jira/browse/TIKA-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527822#comment-17527822
]
Ross Johnson commented on TIKA-3732:
------------------------------------
I took a quick look at the attached file in a hex editor and can confirm that
it is indeed an RTF file despite the file extension being .DOC. It appears that
Tika is detecting the type correctly.
> Word doc MediaType detected as RTF
> ----------------------------------
>
> Key: TIKA-3732
> URL: https://issues.apache.org/jira/browse/TIKA-3732
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 2.2.1
> Reporter: Caleb Postlethwait
> Priority: Major
> Attachments: example.DOC
>
>
> When executing Detector.detect(InputStream input, Metadata metadata) on a
> particular Word document, we're getting back a MediaType of RTF which has
> some downstream effects for us.
> Here's the relevant bit of code:
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> stream = TikaInputStream.get(fis = new FileInputStream(paths));
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths);
> *MediaType mediaType = detector.detect(stream, metadata);*
> Attaching the file that we came across this issue on.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)