[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Wed, 24 Nov 2010 16:05:43 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935583#action_12935583
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

One other issue has popped up with the merge.

With my original code, the TextPosition objects contained a string of 
characters and now instead, we are using a single character for each 
TextPosition object.

The code still properly detects line separation, because the endX & endY 
positions still work out correctly.

However the code no longer properly detects paragraph starts.

The reason is because my code (in the PDFTextSplitter.isParagraphStart() 
method) was using the TextPosition.getHeight() method to get the maximum height 
of the current string and then comparing the vertical drop after the line 
separation detection.   

However, the TextPosition.getHeightDir() now (1.3.1) returns a very different 
value than it did before so the math is waaaayyy off.   Running through the 
same test document through, for example, at the exact same processing point in 
the document,  I have that value as 14.83628 in the PDFBox 1.0.0 based code, 
and now it shows as 0.281 in the PDFBox 1.3.1 based-code.   That is a factor 
of, um ... 52.7!!!!!

I suspect it has to do with the changes in the TextPosition object.  In 
addition to the change from containing a string to containing a single 
character, I note that the 'font' member object is different.  Previously, in 
my old code based on PDFBox 1.0.0, in my test document it shows in the debugger 
as a 'PDType1Font'.   Now, in PDFBox 1.3.1  the TextPosition.font member points 
to a 'PDType1CFont' object.

Doing a bit of sleuthing, I traced things back a bit and I see that the 
getHeightDir() is an accessor for the 'maxTextHeight' field (10 fail points for 
not using matching names!) which is set via the TextPosition constructor 
'maxFontH' parameter.  

This value is calculated in the PDFStreamEngine.processEncodedText(byte[]) 
method with the line:

            float totalVerticalDisplacementDisp = maxVerticalDisplacementText * 
fontSizeText;

Now, looking back in the prior (1.0.0) code, this used to be calculated instead 
like so:

            float totalVerticalDisplacementDisp = maxVerticalDisplacementText * 
fontSizeText * yScaleDisp;

I don't know yet whether that was intentional or not.  There are a lot of 
changes to the processEncodedText(byte[]) method and I haven't had a chance to 
fully figure them out yet.

If this change is unintended, then this is a bug that needs to be fixed in 
order for the paragraph detection to work.  This seems likely since I would 
think that the TextPosition.getHeight() and TextPosition.getHeightDir() methods 
are intended to return in display units, just like the getWidthXX() methods.

If this IS intentional, I can apply the display scaling post-fact using the 
TextPosition.getYScale() method or use an alternative method for calculating 
the relative meaning of the drop gap.

I just need to know. 


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to