[
https://issues.apache.org/jira/browse/TIKA-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1651:
------------------------------
Description:
With recently modified tika eval dev code that captures exceptions from
embedded documents, there are ~30k exceptions in govdocs1 for what we're
currently identifying as xls files embedded in ppt and xls files.
It turns out that these are Microsoft Chart files/objects. We are currently
identifying them as xls. Let's add mime detection to these embedded objects
and see if we can use POI to parse the contents of embedded tables when there
are embedded tables.
was:
I haven't had a chance to look into this at all, but I wanted to open an issue
to track this. With recently modified tika eval dev code that captures
exceptions from embedded documents, there are ~30k exceptions in govdocs1 for
xls files embedded in ppt and xls files.
There's a chance that something went wrong with the eval code, and there's a
chance that these files are mis-typed, but we should take a look.
Example files to follow.
> Add mime (and parsing?) for Microsoft Chart object
> --------------------------------------------------
>
> Key: TIKA-1651
> URL: https://issues.apache.org/jira/browse/TIKA-1651
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Attachments: 11.xls, 428996.ppt, embedded_xls_stack_traces.csv
>
>
> With recently modified tika eval dev code that captures exceptions from
> embedded documents, there are ~30k exceptions in govdocs1 for what we're
> currently identifying as xls files embedded in ppt and xls files.
> It turns out that these are Microsoft Chart files/objects. We are currently
> identifying them as xls. Let's add mime detection to these embedded objects
> and see if we can use POI to parse the contents of embedded tables when there
> are embedded tables.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)