[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Tue, 23 Nov 2010 11:43:42 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935002#action_12935002
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

Andreas,

I finally have had a chance to return to this after a long time spent on other 
tasks.

I have just started trying to work with the PDFBox 1.3.1 code which includes 
this latest version of PDFTextStripper.

I don't have a big problem with your overall changes to the way the 
'normalize()' method worked, though it is now less object oriented.   One of 
the reasons that I structured the method the way I did, with the use of a 
'NormalizedTextPosition' subclass of TextPosition is that enabled easy 
subclassing of the stripper to do specialized extraction to specific formats 
where there might be additional character normalization/transformations needed.

For example, one of the main points of the whole rewrite of the stripper was to 
enable instrumentation of the page / arcticle  paragraph boundaries.   This 
enables me to very easily create a subclass that outputs to an XML format.

99% of that still works just fine with the PDFTextStripper that resulted from 
your merge.  The only problem is that I was relying on the ability to override 
/ enhance the normalize() method in order to assert that the characters are 
valid XML characters.

I get what you are saying about the single-character TextPositions and rtl.   
So a solution that does not rely on subclassing the TextPosition for 
normalization is fine.   However, i still think that the 
'PDFTextStripper.normalize(List,boolean)' method should be 'protected' and not 
'private'.

That would allow subclasses to override the normalization.

Could we make that method 'protected' in the next build?

In addition, we should also make the 'PDFTextStriper.WordSeparator' inner class 
'protected' (though still left as 'final').    Subclass visibility of that 
marker class is needed to test the stream if one DOES override the normalize() 
method.

I'm trying to reduce the amount of custom code I have to wrap around using 
PDFBox but without those two changes I don't know if I can upgrade to 1.3.1.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to