[ 
https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337531#comment-16337531
 ] 

Andreas Meier commented on TIKA-2527:
-------------------------------------

I attached a patch to address the mentioned problems.

 

Furthermore I added three new MIMEType sections for application/x-lz4, 
Image/x-tga and audio/x-caf.

The Image/x-tga part had to be placed in front of the application/x-123 
mime-type recognition, because the starting bytes overlap in some cases.

The important part of the Image/x-tga recognition is the inner match that 
searches for the trailing part

54 52 55 45 56 49 53 49   TRUEVISI
4F 4E 2D 58 46 49 4C 45   ON-XFILE
2E 00                     ..

 

Is there an easier possibility to search for trailing magic-strings than using 
a regex?

I thought that a simple regex might be to expensive to recognize Image/x-tga, 
therefore I combined the recognition with the basic tga-recognition of the 
linux magic file.

 

While testing tika.mimetypes.xml I noticed that I often thought that the 
matching string already was correct, when the actual recognition was done by 
the file-extension. Therefore I had to remove the fileextensions of my 
testfiles to validate the matching parts.

To avoid this I suggest to create either a testcase that only takes care of the 
matches without taking file-extensions into account or to delete the 
fileextensions of testfiles to validate the matchings.

Is there a testcase that does this already?

 

If you have any questions or suggestions I would be glad to hear them.

> Typos in tika-mimetypes.xml
> ---------------------------
>
>                 Key: TIKA-2527
>                 URL: https://issues.apache.org/jira/browse/TIKA-2527
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.0, 1.16, 1.17, 1.18
>         Environment: ALL
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: 
> fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch
>
>
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} <mime-type type="audio/x-adbcm">{code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} <mime-type type="audio/x-dec-adbcm">{code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to