[
https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mel Martinez updated PDFBOX-659:
--------------------------------
Attachment: patch_pdfbox_659.txt
The attached patch fixes the problem of incorrectly inserted newlines.
The problem was (as described above) due to TextPosition coords using negative
space and the code incorrectly using a reset comparision value of '-1.0'.
This patch does not fix some additional problems that surface with the example
.pdf file that include the following:
Missing space characters (words are arbitrarily catenated together) and missing
characters.
The missing space characters can be recovered by setting the value of:
PDFTextStripper.setSpacingTolerance(float tolerance)
To a value smaller than the default (0.5). I had to drop it quite a bit with
this document and still did not recover all the spaces.
The missing characters are caused by the default mode of suppressing what the
code believes to be duplicate, overlapping characters. This can occur with MS
Word-generated PDFs. You can stop that behavior by setting the attribute:
PDFTextStripper.setSuppressDuplicateOverlappingText(boolean suppress);
to a false.
That said, the logic used when this is set to 'true' looks flawed. I will open
a separate bug for that.
> Newlines added in the middle of words
> -------------------------------------
>
> Key: PDFBOX-659
> URL: https://issues.apache.org/jira/browse/PDFBOX-659
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 0.8.0-incubator
> Reporter: Mario Sangiorgio
> Attachments: fulltext.pdf, page.png, patch_pdfbox_659.txt
>
>
> I am experiencing issues getting the text from a PDF document. The document I
> want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle
> of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service
> Adaptation".
> Please let me know if I may help finding the bug
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.