[
https://issues.apache.org/jira/browse/PDFBOX-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030810#comment-17030810
]
ASF subversion and git services commented on PDFBOX-4760:
---------------------------------------------------------
Commit 1873657 from [email protected] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1873657 ]
PDFBOX-4760: don't check for word separator at the end of a word if separator
is empty
> wordSeparator not being inserted when word ends with " "
> --------------------------------------------------------
>
> Key: PDFBOX-4760
> URL: https://issues.apache.org/jira/browse/PDFBOX-4760
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.18
> Reporter: John Gesimondo
> Priority: Major
> Fix For: 2.0.19, 3.0.0 PDFBox
>
>
> If you set the wordSeparator to something other than space (like "\t") for
> instance, but the word happens to end with " ", it won't add the designated
> wordSeparator.
> That's because in the LOC that adds the word separator it is hard coded that
> if the line ends with " ", the word separator should be skipped. This is not
> proper because it assumes you are using " " word separator, but this is a
> configurable option.
> [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L639]
> fix: change {{.endsWith(" ")}} to {{.endsWith(wordSeparator)}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]