Tim Allison created TIKA-2846:
---------------------------------
Summary: Add per page unicode mapping stats to the metadata in the
PDFParser
Key: TIKA-2846
URL: https://issues.apache.org/jira/browse/TIKA-2846
Project: Tika
Issue Type: Task
Reporter: Tim Allison
As part of TIKA-2749, it would be useful to gather stats on characters that did
not have a unicode mapping. Users could use this information now to determine
which pages might benefit from OCR.
I propose an array of floats/doubles, with one entry per page. The
float/double would be (# of characters without unicode mapping/total # of
characters) per page.
We might also want an int array for total number of characters per page.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)