[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Fri, 26 Nov 2010 10:03:41 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936068#action_12936068
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

One more thing ... sorry ... 

In order to easily patch this problem (TextPosition.getHeight() is not properly 
scaled to display units) in a PDFTextStripper subclass one can either

a) override 'processEncodedText(byte[])'  - but this is a large complex method 
and the nature of the problem requires basically copying the entire method into 
the override (no defering to 'super.processEncodedText()' here).   Upon 
copying, one runs into the fact that the method makes use of several private 
member fields which would not (and should not) be visible to a subclass.  
Everyone of these fields DOES have a public accessor though.  It is generally a 
much better coding guideline to use these accessor methods to access member 
fields even when inside the class.  This override problem would not exist and 
indeed the code would be much better off it did so.  Specifically,  we should 
replace use of

graphicsState   -> with -> getGraphicsState()
textMatrix   -> with -> getTextMatrix()

Further, there should be a protected or public access method for accessing the 
'page' member and the the SPACE_BYTES constant should be at least made 
protected, if not public.

The irony here is that the philosophy of design of the PDFStreamEngine class is 
that one should be able to override the processXXX methods - the javadoc 
comment specifically references that.  We have drifted from that with the way 
the code currently is.

b) alternatively, in order to fix the yScale problem in a subclass all one in 
theory needs to do is override the isParagraphStart() method and multiply the 
value of 'position.getTextPosition().getHeight()' by the y-scale 
(position.getTextPosition().getYScale()).
Unfortunately, for some reason, when I defined the PositionWrapper class I made 
the 'getTextPosition()' method protected instead of public.  My bad.   So in 
order to make this work, I have to create yet another sub-class 'wrapper' of 
PositionWrapper  that opens up the access to the getTextPosition() method!   
Geeze I feel dumb for that one ...

We should change the access of PositionWrapper.getTextPosition() to public.



> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to