[
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099475#comment-17099475
]
Tim Allison edited comment on TIKA-3094 at 5/5/20, 1:27 AM:
------------------------------------------------------------
Thank you [~bob]!
For kicks, I ran the osgi'd Tika against all of our test files and found a few
more issues. I left in the Ignored unit test so that you can see what I'm
saying.
1) jdom2 is needed by the rss parser (I fixed this in master)
2) java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by
org.apache.tika.bundle [19] ...can't figure out how to fix this
3) We're left with several exceptions caused by adding the wrong type of
metadata, and we aren't seeing those with regular Tika. I can't figure out why
we're getting these in OSGi but not in regular Tika.
On 2), I tried a bunch of variants of the package that should bring that in,
but had no luck. [~bob], sorry, again, can you take a look?
On 3), I'll look more closely tomorrow to try to figure out what's going on.
was (Author: [email protected]):
Thank you [~bob]!
For kicks, I ran the osgi'd Tika against all of our test files and found a few
more issues. I left in the Ignored unit test so that you can see what I'm
saying.
1) jdom2 is needed by the rss parser (I fixed this in master)
2) java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by
org.apache.tika.bundle [19] ...can't figure out how to fix this
3) We're left with several exceptions caused by adding the wrong type of
metadata, and we aren't seeing those with regular Tika. I can't figure out why
we're getting these in OSGi but not in regular Tika.
On 2), I tried a bunch of variants of the package that should bring that in,
but had no luck.
On 3), I'll look more closely tomorrow to try to figure out what's going on.
> Apache Tika fails to extract text for pptx extension.
> -----------------------------------------------------
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.24, 1.24.1
> Reporter: Abhishek Chauhan
> Assignee: Bob Paulin
> Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx
> ententions which was earlier working with Apache Tika 1.23 is no longer
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you
> have updated the POI to 4.1.2. That might be the root cause of this problem.
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2]
> which is not present in bundle I guess.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)