[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Fri, 26 Nov 2010 15:33:39 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936127#action_12936127
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

Andrea,  again - I apologize for not being able to dive into this in detail 
until (unfortunately _after_ the v1.3.1 release).

Totally my fault.

Yes, those two are the most pressing.

The API access controls should be addressed, but they are secondary to the 
functional fix.

I played with restoring the application of the yScaleDisp factor in the 
maxHeight calculation.   It 'works' but since the TextPosition objects now have 
only a single character, the values are only for the one char and do not really 
capture the 'max height' for the line of text.   Technically, neither did the 
the prior version, but it came pretty close.   The values are coming in about a 
factor of 1/2 of what was calculated before.  Thus to get similar paragraph 
detection results, instead of a drop threshold value of ,say, 2.5, one needs to 
use 5.0 or so.  But that is over-simplifying.  The variance in ratio is pretty 
wide.   A line with a couple of font sizes will really screw up.  Heck, a 
single font with widely varying heights will also screw up.

To do this correctly, what really needs to be done is to check the maxHeight of 
each TextPosition in the whole line between line separation points and retain 
the largest value to use in the isParagraphStart() calculation.

Like I said, I'll try to put together a rewrite that addresses all these 
issues.   It may take me a couple of days - its a holiday weekend here!  :-D




> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to