[ 
https://issues.apache.org/jira/browse/TIKA-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1651:
------------------------------
    Description: 
With recently modified tika eval dev code that captures exceptions from 
embedded documents, there are ~30k exceptions in govdocs1 for what we're 
currently identifying as xls files embedded in ppt and xls files. 

It turns out that these are Microsoft Chart files/objects.  We are currently 
identifying them as xls.  Let's add mime detection to these embedded objects 
and see if we can use POI to parse the contents of embedded tables when there 
are embedded tables.

  was:
I haven't had a chance to look into this at all, but I wanted to open an issue 
to track this.  With recently modified tika eval dev code that captures 
exceptions from embedded documents, there are ~30k exceptions in govdocs1 for 
xls files embedded in ppt and xls files. 

There's a chance that something went wrong with the eval code, and there's a 
chance that these files are mis-typed, but we should take a look.

Example files to follow.


> Add mime (and parsing?) for Microsoft Chart object
> --------------------------------------------------
>
>                 Key: TIKA-1651
>                 URL: https://issues.apache.org/jira/browse/TIKA-1651
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>         Attachments: 11.xls, 428996.ppt, embedded_xls_stack_traces.csv
>
>
> With recently modified tika eval dev code that captures exceptions from 
> embedded documents, there are ~30k exceptions in govdocs1 for what we're 
> currently identifying as xls files embedded in ppt and xls files. 
> It turns out that these are Microsoft Chart files/objects.  We are currently 
> identifying them as xls.  Let's add mime detection to these embedded objects 
> and see if we can use POI to parse the contents of embedded tables when there 
> are embedded tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to