[
https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176070#comment-13176070
]
Nick Burch commented on TIKA-793:
---------------------------------
I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded
strings.
I've fixed the first in r1224865, which was a problem in the null termination
stripping
The second is the handling of the COMM (Comment) tag, which contains both a
language and text. We don't currently support the language being encoded
differently to the text, that remains to be fixed (and really needs a test file
too)
> Invalid ASCII character (65533) when retriving MP3 metadata
> -----------------------------------------------------------
>
> Key: TIKA-793
> URL: https://issues.apache.org/jira/browse/TIKA-793
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.0
> Environment: Ubuntu 10.04 (x64), Android (2.2 +)
> Reporter: William Seemann
> Priority: Minor
> Attachments: TikaTest.java
>
>
> When extracting metadata from certain mp3's (the id3 version appears to be
> 2.4) I'm seeing invalid characters at the end of the parsed fields. For
> example:
> American M�
> which should be:
> American Me
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira