Tika 0.8 line break removal is faulty (misses space when concatenating lines)
for PDF file
------------------------------------------------------------------------------------------
Key: TIKA-583
URL: https://issues.apache.org/jira/browse/TIKA-583
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.8
Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
Reporter: Dennis Adler
The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the
following as its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
SERGEY SAVCHUK, )
) No. 64269-3-I
Appellant, )
v. )
) UNPUBLISHED OPINION
STEVEN G. JERDE and )
DARLYCE J. JERDE, husband and wife )
)
Respondents. )
_______________________________ ) FILED: November 1, 2010
--------------- end ---------------------
Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION
ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG.
JERDE and )DARLYCE J. JERDE, husband and
wife))Respondents.)_______________________________ )FILED: November 1,
2010schindler, j
--------------- end ---------------------
Notice that as part of the improved paragraph breaking for PDF files, the
"header" of the document had lines catenated together without spaces in
between, creating run-on words (e.g. "WASHINGTONDIVISION" and
"ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the
text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.