[
https://issues.apache.org/jira/browse/TIKA-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050868#comment-18050868
]
Tim Allison edited comment on TIKA-4612 at 1/9/26 3:13 PM:
-----------------------------------------------------------
And of course, that conversation doesn't include all the back and forth we had
to do after that conversation to balance some other items that weren't
considered. Including...
{noformat}
Yes, I'm surprised - and I was wrong in my earlier analysis. I said there was
"no overlap" between MP3 and AAC patterns because I focused on the frame sync
bytes:
- MP3: 0xFFF2-0xFFFF, 0xFFE3
- AAC: 0xFFF0, 0xFFF1, 0xFFF8, 0xFFF9 Those don't overlap. But I missed that
MP3's magic also includes a simple ID3 match at offset 0 (line 5951): <match
value="ID3" type="string" offset="0"/> So an AAC file with ID3 tags matches:
1. MP3's ID3 pattern at offset 0 (priority 50 after bump)
2. AAC's two-frame-sync pattern at offset 512:8192 (priority 40) MP3 wins
because 50 > 40. My analysis was incomplete - I should have checked all the
match patterns for MP3, not just the frame sync bytes. The "belt and
suspenders" recommendation was based on faulty reasoning.
{noformat}
was (Author: [email protected]):
And of course, that conversation doesn't include all the back and forth we had
to do after it to balance some other items that weren't considered.
{noformat}
Yes, I'm surprised - and I was wrong in my earlier analysis. I said there was
"no overlap" between MP3 and AAC patterns because I focused on the frame sync
bytes:
- MP3: 0xFFF2-0xFFFF, 0xFFE3
- AAC: 0xFFF0, 0xFFF1, 0xFFF8, 0xFFF9 Those don't overlap. But I missed that
MP3's magic also includes a simple ID3 match at offset 0 (line 5951): <match
value="ID3" type="string" offset="0"/> So an AAC file with ID3 tags matches:
1. MP3's ID3 pattern at offset 0 (priority 50 after bump)
2. AAC's two-frame-sync pattern at offset 512:8192 (priority 40) MP3 wins
because 50 > 40. My analysis was incomplete - I should have checked all the
match patterns for MP3, not just the frame sync bytes. The "belt and
suspenders" recommendation was based on faulty reasoning.
{noformat}
> Some mp3 files are detected as audio/x-aac instead of audio/mpeg
> ----------------------------------------------------------------
>
> Key: TIKA-4612
> URL: https://issues.apache.org/jira/browse/TIKA-4612
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.9.0, 3.2.3
> Reporter: V. S.
> Assignee: Tim Allison
> Priority: Major
> Attachments: mp3-v-aac-claude.txt, test.mp3
>
>
> When reading the attached test.mp3 file into Tika.detect, _all versions since
> Tika 2.9.0_ incorrectly report "audio/x-aac" instead of "audio/mpeg". Tika
> 2.8.0 reports "audio/mpeg" correctly.
> I believe this might be due to the priority setting here, but I am not fully
> aware how this works:
> [https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6166|https://github.com/apache/tika/blob/3.2.3/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6166]
> Note that I can only supply the first 1024 bytes of the MP3 file due to legal
> reasons. However, this seems to be enough for the detection logic.
> This error has occured with about 30% of the MP3 files we were processing.
>
> Other tools correctly report MP3, e.g.
> {{$ file test.mp3 }}
> {{test.mp3: Audio file with ID3 version 2.3.0, contains:\012- MPEG ADTS,
> layer III, v2, 64 kbps, 16 kHz, JntStereo}}
>
> Minimal test program:
> {{{}package com.example;{}}}{{{}import org.apache.tika.Tika;{}}}
> {{import java.io.FileInputStream;}}
> {{{}import java.io.IOException;{}}}{{{}public class TikaTest {{}}}{{ public
> static void main(String args[]) {}}
> {{ Tika tika = new Tika();}}
> {{ }}
> {{ try (FileInputStream fis = new FileInputStream("test.mp3")) {}}
> {{ System.out.println(tika.detect(fis));}}
> {{ } catch (IOException e) { }}
> {{ e.printStackTrace(); }}
> {{ }}}
> {{ }}}
> {{}}}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)