[ 
https://issues.apache.org/jira/browse/TIKA-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364890#comment-17364890
 ] 

Tim Allison edited comment on TIKA-3449 at 6/17/21, 4:41 PM:
-------------------------------------------------------------

diffs in date -- created and modified -- the legacy parser prefers the "track 
create date" and the "track modify date" to the "media create date" and the 
"media modified date". 

When audio sample rate differs: the newer parser agrees with exiftool, and I 
can't see in any of exiftool's output the value for the older parser.  In 
short, I think the new parser fixes a bug in the legacy parser.

Legacy parser included length and width metadata for audio, with value of zero.

I added a number of fields in the newer parser including subject, description, 
copyright and a few others.  So there's more metadata than in the legacy 
parser.  Still on the todo list is to parse the embedded album cover image 
file. :D

The only other difference I'm now seeing is that if the apple user data box is 
truncated, then the new parser is extracting no information, whereas the legacy 
parser tried to extract as much as it could.  If we need to, we can write our 
own box iterator to try to scrape as much as we can, but I don't think most 
users will see this often.  Please open a ticket if there's a need for this.

There were a few cases where the new parser was able to extract information 
that the legacy parser wasn't... I think it depends on where the file was 
truncated.

Most of the mp4s in our corpora are truncated.  In the new parser, I use the 
existing parse-warn key in the metadata.  We're no longer throwing EOF if the 
mp4 is truncated, which is great because mp4s can stop short and this was an 
ongoing annoyance with the legacy parser. :D



was (Author: [email protected]):
diffs in date -- created and modified -- the legacy parser prefers the "track 
create date" and the "track modify date" to the "media create date" and the 
"media modified date". 

When audio sample rate differs: the newer parser agrees with exiftool, and I 
can't see in any of exiftool's output the value for the older parser.  In 
short, I think the new parser fixes a bug in the legacy parser.

Legacy parser included length and width metadata for audio, with value of zero.

I added a number of fields in the newer parser including subject, description, 
copyright and a few others.  So there's more metadata than in the legacy 
parser.  Still on the todo list is to parse the embedded album cover image 
file. :D

The only other difference I'm now seeing is that if the apple user data box is 
truncated, then the new parser is extracting no information, whereas the legacy 
parser tried to extract as much as it could.  If we need to, we can write our 
own box iterator to try to scrape as much as we can, but I don't think most 
users will see this often.  Please open a ticket if there's a need for this.

There were a few cases where the new parser was able to extract information 
that the legacy parser wasn't... I think it depends on where the file was 
truncated.

Most of the mp4s in our corpora are truncated.  In the new parser, I added a 
parse-warn key in the metadata, and that info is now being stored there.  We're 
no longer throwing EOF if the mp4 is truncated, which is great because mp4s can 
stop short and this was an ongoing annoyance with the legacy parser. :D


> Remove sannies mp4 isoparser from Tika 2.x
> ------------------------------------------
>
>                 Key: TIKA-3449
>                 URL: https://issues.apache.org/jira/browse/TIKA-3449
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> If we can prove equality or improvement in Drew Noakes' metadata-extractor's 
> MP4Parser over the no longer supported sannies' Mp4Parser, we should remove 
> sannies in 2.x.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to