[
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980809#action_12980809
]
Mel Martinez commented on PDFBOX-588:
-------------------------------------
I can't confirm the most recent version performance.
But if it is slower, it is probably not because of the paragraph, page &
article demarcation code.
I tested the addition of that versus v0.8 extensively and it added essentially
no time.
It is possible the lower-level parsing is slower for some reason. There have
been a variety of other changes in PDF Box. In some cases what may seem like
only a marginal improvement in parsing quality (you indicate the 'results were
better') comes from a LOT more work being done in the parser to get things
correct.
Keep in mind that the text extraction is really a 'rendering' of what the
parsing has discovered. Both stages affect the total performance time, but the
parsing is generally the dominant phase.
I dont have time right now but I will definitely look more into this. I have a
good profiling tool. Performance of the parsing & text extraction is very
important to my work.
> Problem extracting text in newline characters
> ---------------------------------------------
>
> Key: PDFBOX-588
> URL: https://issues.apache.org/jira/browse/PDFBOX-588
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
> Environment: Win XP
> Reporter: Hesham
> Assignee: Andreas Lehmkühler
> Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt,
> PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png,
> PDFTextStripper.patch
>
>
> Hello ,
>
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line
> and the first word in the next line appear as 1 word without spaces between
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text
> editor, Enter characters appear after the same lines that caused the problem
> in PDFBox.
> Please check the attached file as a sample.
>
> Is there a way to fix this ?
>
> Best regards ,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.