Matt Sheppard created TIKA-911:
----------------------------------
Summary: Converted PDF document contains question marks in place
of spaces and inconsistent case
Key: TIKA-911
URL: https://issues.apache.org/jira/browse/TIKA-911
Project: Tika
Issue Type: Bug
Affects Versions: 1.1
Reporter: Matt Sheppard
The PDF document at
http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf,
when converted with tika v1.1 using
{code}
$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
{code}
Produces substantially worse output than xpdf's pdftotext program.
Specifically, we see...
Some 'spaces' replaced with question marks
{noformat}
...
<body><div class="page"><p/>
<p>How can I help?
When you're overseas:
• ?wherever?possible,?don't?visit?crops?—?contact?with?
</p>
<p>growing?crops?greatly?increases?the?risk?of?contaminating?
footwear?or?clothing;?
...
{noformat}
and some odd case conversions
{noformat}
<p>stem rust in wheat.
(soURce: BRAd collIs)</p>
<p/>
</div>
{noformat}
(The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
To compare that with pdftotext
{code}
$ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\
Brochure.pdf
{code}
This does not output the question marks, and produces "Source: BRAD COLLIS" at
the end there, both of which seem to be improvements. Note that it does,
however, produce a number of ^G characters which are not desireable.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira