[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Wed, 01 Dec 2010 07:59:51 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965720#action_12965720
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

Andreas,

I finally built everything (as usual, I had to do a bit of hackiness to get 
around my inability to use Maven (yet) in my current environment, but I got it 
done).

The patches you made work!  Thank you so much!

That said, per my comment above, there is some difference in behavior over my 
prior subclass-based solution built on top of v1.0.

Because the newer code uses only one character per TextPosition, the max height 
is more often smaller than the true max height for a line.

The default drop threshold (2.5) that we had in before does not get good 
paragraph detection results.  It improperly inserts too many paragraph 
separations in my test documents.

I got better results using a drop threshold of 3.0 and the results are 'mostly' 
acceptable.  It still messes up a lot more than I'd like.

I think that a more accurate calculation would instead of using the max height 
of the individual character position, the logic instead should check the max 
height field for each text position in the line (i.e. between line separations) 
and use the largest value in the line for the comparison with the ygap when 
detecting whether a line separation is also a paragraph separation.

I'll look into coding that idea up to see if it works better.

Another change is that various odd characters that used to get inserted into 
the text as 'C11' or 'C23', etc.,  get inserted as non-printables.   Examples 
include greek 'alpha'  and 'gnu' characters.   I don't consider this a big deal 
for our purposes, but it might be for others.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to