[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965720#action_12965720
]
Mel Martinez commented on PDFBOX-521:
-------------------------------------
Andreas,
I finally built everything (as usual, I had to do a bit of hackiness to get
around my inability to use Maven (yet) in my current environment, but I got it
done).
The patches you made work! Thank you so much!
That said, per my comment above, there is some difference in behavior over my
prior subclass-based solution built on top of v1.0.
Because the newer code uses only one character per TextPosition, the max height
is more often smaller than the true max height for a line.
The default drop threshold (2.5) that we had in before does not get good
paragraph detection results. It improperly inserts too many paragraph
separations in my test documents.
I got better results using a drop threshold of 3.0 and the results are 'mostly'
acceptable. It still messes up a lot more than I'd like.
I think that a more accurate calculation would instead of using the max height
of the individual character position, the logic instead should check the max
height field for each text position in the line (i.e. between line separations)
and use the largest value in the line for the comparison with the ygap when
detecting whether a line separation is also a paragraph separation.
I'll look into coding that idea up to see if it works better.
Another change is that various odd characters that used to get inserted into
the text as 'C11' or 'C23', etc., get inserted as non-printables. Examples
include greek 'alpha' and 'gnu' characters. I don't consider this a big deal
for our purposes, but it might be for others.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.