[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936068#action_12936068
]
Mel Martinez commented on PDFBOX-521:
-------------------------------------
One more thing ... sorry ...
In order to easily patch this problem (TextPosition.getHeight() is not properly
scaled to display units) in a PDFTextStripper subclass one can either
a) override 'processEncodedText(byte[])' - but this is a large complex method
and the nature of the problem requires basically copying the entire method into
the override (no defering to 'super.processEncodedText()' here). Upon
copying, one runs into the fact that the method makes use of several private
member fields which would not (and should not) be visible to a subclass.
Everyone of these fields DOES have a public accessor though. It is generally a
much better coding guideline to use these accessor methods to access member
fields even when inside the class. This override problem would not exist and
indeed the code would be much better off it did so. Specifically, we should
replace use of
graphicsState -> with -> getGraphicsState()
textMatrix -> with -> getTextMatrix()
Further, there should be a protected or public access method for accessing the
'page' member and the the SPACE_BYTES constant should be at least made
protected, if not public.
The irony here is that the philosophy of design of the PDFStreamEngine class is
that one should be able to override the processXXX methods - the javadoc
comment specifically references that. We have drifted from that with the way
the code currently is.
b) alternatively, in order to fix the yScale problem in a subclass all one in
theory needs to do is override the isParagraphStart() method and multiply the
value of 'position.getTextPosition().getHeight()' by the y-scale
(position.getTextPosition().getYScale()).
Unfortunately, for some reason, when I defined the PositionWrapper class I made
the 'getTextPosition()' method protected instead of public. My bad. So in
order to make this work, I have to create yet another sub-class 'wrapper' of
PositionWrapper that opens up the access to the getTextPosition() method!
Geeze I feel dumb for that one ...
We should change the access of PositionWrapper.getTextPosition() to public.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.