[ 
https://issues.apache.org/jira/browse/TIKA-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan LI updated TIKA-684:
-----------------------------

    Attachment: 2eebe3db1196aa8ea58c9be83965f0eb.ppt

Source file from Enron Sample Data Set - 
http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2

License: Creative Commons Attribution 3.0 United States License.

> Partial/Incomplete text extraction for certain Powerpoint files
> ---------------------------------------------------------------
>
>                 Key: TIKA-684
>                 URL: https://issues.apache.org/jira/browse/TIKA-684
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Jonathan LI
>         Attachments: 2eebe3db1196aa8ea58c9be83965f0eb.ppt
>
>
> Example file with issue attached.
> Tika throws exception during text extraction of certain powerpoints.  In this 
> example file, the extracted text only goes up to slide 37.  Text from slides 
> 38-40 are missing.
> Tested via both tika library and tika GUI. Apache POI (3.8 beta 3 & 3.7) 
> doesn't have any issues with text extraction of this file. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to