[
https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417703#comment-16417703
]
Tim Allison edited comment on TIKA-2569 at 3/30/18 4:14 PM:
------------------------------------------------------------
Whoa! This added a huge amount of newly extracted text in our regression
corpus. In a preliminary analysis (ymmv), we went from 56,400,782 to
57,423,955 common tokens. That's roughly a 2% increase in extracted content.
Thank you, [~BAEApache]!
was (Author: [email protected]):
Whoa! This added a huge amount of newly extracted text in our regression
corpus. Thank you, [~BAEApache]!
> Grouped Text boxes in .ppt
> --------------------------
>
> Key: TIKA-2569
> URL: https://issues.apache.org/jira/browse/TIKA-2569
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.16
> Reporter: Richard A
> Assignee: Tim Allison
> Priority: Major
> Labels: easyfix
> Fix For: 1.18, 2.0.0
>
> Attachments: Presentation1.ppt, Presentation1.pptx
>
>
> Grouped Text boxes are unable to be parsed and no content is returned when
> items have been grouped together. This issue does not seem to affect .pptx
> files, only .ppt. The attached documents are the same except the file format.
> It should give a very simple example of a .ppt document where no content will
> be returned.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)