[
https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981822#action_12981822
]
Ken Krugler commented on TIKA-583:
----------------------------------
Is this a PDFBox issue or a Tika issue? Any chance you could re-run it with
Tika 0.8, but using the PDFBox jar from Tika 0.7?
> Tika 0.8 line break removal is faulty (misses space when concatenating lines)
> for PDF file
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-583
> URL: https://issues.apache.org/jira/browse/TIKA-583
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
> Reporter: Dennis Adler
> Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has
> the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
> SERGEY SAVCHUK, )
> ) No. 64269-3-I
> Appellant, )
> v. )
> ) UNPUBLISHED OPINION
> STEVEN G. JERDE and )
> DARLYCE J. JERDE, husband and wife )
> )
> Respondents. )
> _______________________________ ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION
> ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG.
> JERDE and )DARLYCE J. JERDE, husband and
> wife))Respondents.)_______________________________ )FILED: November 1,
> 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the
> "header" of the document had lines catenated together without spaces in
> between, creating run-on words (e.g. "WASHINGTONDIVISION" and
> "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the
> text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.