[ 
https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154110#comment-13154110
 ] 

Nick Burch commented on TIKA-786:
---------------------------------

The problem seems to be with how DefaultDetector handles conflicting detection, 
which is different to how the previous ContainerAwareDetector did so

Previously, the logic was to ask the container detectors to review the file. If 
they had a good match, that was used as the mimetype. Only if the container 
ones didn't know would the mime magic+filename detection (provided by 
MimeTypes) be used

Under the new DefaultDetector system, this has changed. Instead, each detector 
is tried in turn, and while detectors are allowed to specialise a file they are 
not permitted to change it completely (if a previous one was wrong)

It looks like this DefaultDetector logic will need to be changed, to allow 
detectors such as the container ones to override incorrect (typically filename 
based) detection
                
> Tika CLI --detect returns incorrect content-type for files with altered 
> extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the 
> following was requested as a new bug: Tika CLI will return incorrect content 
> type information when called with --detect for files that have had their 
> extensions modified (and nothing else).  MS Word (.doc) documents that have 
> their extension changed to .xls or .ppt will be incorrectly detected as Excel 
> or PowerPoint documents, whereas the --metadata option will determine the 
> content type correctly (as application/msword), based on the actual contents 
> of these mis-named files.  The same also occurs with other types of MS Office 
> 2003 documents, and could possibly occur with a wide range of document types. 
>  To quote Nick B., from the user mailing list: "If you look at the 
> TestMediaTypes class you'll see what you can get with just the mime magic and 
> filenames, and then there's TestContainerAwareDetector which shows the 
> correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to