[
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730649#comment-17730649
]
Nick Burch commented on TIKA-4060:
----------------------------------
0x494443 is the string ID3, which I think ought to be at the start. It is in
the handful of files I've found. The rest of the magic is pretty vague and a
little prone to false positives, so I'm reluctant to match on the string "ID3"
anywhere in the first 2kb and then the vague 3 bytes somewhere else further on.
I've tried to make the matches a little "tighter" to hopefully reduce false
positives, just seem to have gone too tight - the test file I produced with ID3
tags does have the ID3 at the start. The hex dump key sections are:
{{00000000 49 44 33 03 00 00 00 00 09 6b 54 50 45 31 00 00 |ID3......kTPE1..|}}
{{00000010 00 0c 00 00 00 54 65 73 74 20 41 72 74 69 73 74 |.....Test Artist|}}
{{...}}
{{00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|}}
{{*}}
{{000004f0 00 00 00 00 00 ff f1 50 80 32 5f fc de 02 00 4c |.......P.2_....L|}}
> Add magic to audio/aac in tika-mimetypes.xml
> --------------------------------------------
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
> Issue Type: Sub-task
> Reporter: Gregory Lepore
> Priority: Minor
> Attachments:
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef,
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file
> extension. PRONOM recently added support for identifying aac files, but the
> signature is tricky. There are two signatures, below in PRONOM format curly
> braces mean to look ahead between the two values for the subsequent patterns.
>
> The first pattern is pretty basic, the second pattern is the first pattern
> after a 2048 ID3 header.
>
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
--
This message was sent by Atlassian Jira
(v8.20.10#820010)