[ 
https://issues.apache.org/jira/browse/TIKA-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448128#comment-17448128
 ] 

Luís Filipe Nassif commented on TIKA-3596:
------------------------------------------

Simply changing the line I referenced caused 4 tests to break, 2 related to 
TIKA-426. I thought about making this magic 
https://github.com/apache/tika/blob/324f2f2ccff21c608969e2e79da88e71379a58dc/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4844
 more restrict, but it could miss some xml files.

So I think it is safer to make a more strict change to handle the truncated/bad 
encoded xml samples I have, all similar to the uploaded test file. Those files 
cause a MalformedByteSequenceException to be thrown by xerces or jdk11 
saxparser (not by jdk8 saxparser, where those files are detected fine, I hit 
this when upgrading jdk) in XmlRootExtractor, returning a null root element. 
Catching that exception and retrying some times with less bytes in the data[] 
array was enough to detect the samples I have fine, all tests pass.

> Detect corrupted XML files as application/xml instead of text/plain
> -------------------------------------------------------------------
>
>                 Key: TIKA-3596
>                 URL: https://issues.apache.org/jira/browse/TIKA-3596
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector
>    Affects Versions: 1.27, 2.1.0
>            Reporter: Luís Filipe Nassif
>            Assignee: Luís Filipe Nassif
>            Priority: Minor
>         Attachments: test.xyz
>
>
> There is a logic in MimeTypes class to return text/plain for corrupted xml 
> files not detected as text/html here: 
> https://github.com/apache/tika/blob/324f2f2ccff21c608969e2e79da88e71379a58dc/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L281
> I think this should be changed to return application/xml, even if the file is 
> corrupted, like is done for all other mimetypes, being more consistent across 
> file formats. Even if a jpg or doc file is corrupted, image/jpg or 
> application/msword is returned.
> I have about ~2k from ~90k xml files in an internal corpus that trigger this.
> If other fellow devs agree, I can submit a patch and unit test.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to