[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935583#action_12935583
]
Mel Martinez commented on PDFBOX-521:
-------------------------------------
One other issue has popped up with the merge.
With my original code, the TextPosition objects contained a string of
characters and now instead, we are using a single character for each
TextPosition object.
The code still properly detects line separation, because the endX & endY
positions still work out correctly.
However the code no longer properly detects paragraph starts.
The reason is because my code (in the PDFTextSplitter.isParagraphStart()
method) was using the TextPosition.getHeight() method to get the maximum height
of the current string and then comparing the vertical drop after the line
separation detection.
However, the TextPosition.getHeightDir() now (1.3.1) returns a very different
value than it did before so the math is waaaayyy off. Running through the
same test document through, for example, at the exact same processing point in
the document, I have that value as 14.83628 in the PDFBox 1.0.0 based code,
and now it shows as 0.281 in the PDFBox 1.3.1 based-code. That is a factor
of, um ... 52.7!!!!!
I suspect it has to do with the changes in the TextPosition object. In
addition to the change from containing a string to containing a single
character, I note that the 'font' member object is different. Previously, in
my old code based on PDFBox 1.0.0, in my test document it shows in the debugger
as a 'PDType1Font'. Now, in PDFBox 1.3.1 the TextPosition.font member points
to a 'PDType1CFont' object.
Doing a bit of sleuthing, I traced things back a bit and I see that the
getHeightDir() is an accessor for the 'maxTextHeight' field (10 fail points for
not using matching names!) which is set via the TextPosition constructor
'maxFontH' parameter.
This value is calculated in the PDFStreamEngine.processEncodedText(byte[])
method with the line:
float totalVerticalDisplacementDisp = maxVerticalDisplacementText *
fontSizeText;
Now, looking back in the prior (1.0.0) code, this used to be calculated instead
like so:
float totalVerticalDisplacementDisp = maxVerticalDisplacementText *
fontSizeText * yScaleDisp;
I don't know yet whether that was intentional or not. There are a lot of
changes to the processEncodedText(byte[]) method and I haven't had a chance to
fully figure them out yet.
If this change is unintended, then this is a bug that needs to be fixed in
order for the paragraph detection to work. This seems likely since I would
think that the TextPosition.getHeight() and TextPosition.getHeightDir() methods
are intended to return in display units, just like the getWidthXX() methods.
If this IS intentional, I can apply the display scaling post-fact using the
TextPosition.getYScale() method or use an alternative method for calculating
the relative meaning of the drop gap.
I just need to know.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.