[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271201#comment-14271201
]
Tim Allison commented on TIKA-1445:
-----------------------------------
No major problems found via quick and dirty govdocs1 eval.
Let's roll!
Better:
Fewer pdf exceptions, better pdf text extraction (thank you, [~tilman]!)
"fixed exceptions": 2426 xls, 895 ppt, 158 pdf, 17 pps and 5 doc
Note: "fixed exceptions" for xls are driven entirely by [~gagravarr]'s addition
of parsing for xls .4. Thank you, Nick!!!
More attachments for 27 pdf and 1 doc
More metadata values for all comparable file pairs (no exceptions, = number of
attachments)
Areas for investigation:
"new exceptions" 27 xls
173 exceptions for newly added parsing of vnd.ms.excel.sheet.3
Fewer attachments for 19 ppt, 6 doc and 1 rtf
Permanent hangs/oom. These numbers differ by run because of multi-threading,
but we went from 4 to 3.
I'll follow up with investigation of these issues and open appropriate tickets
next week.
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Blocker
> Fix For: 1.7
>
> Attachments: 000003.doc, TIKA-1445.Mattmann.101214.patch.txt,
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch,
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch,
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)