[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Mon, 21 Jun 2010 11:24:48 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880927#action_12880927
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

I have no strong feelings about the correct 'name' to use for this patch.

'LogicalTextStripper' isn't quite on the mark since it is still just using 
geographic landmarks to derive breaks, though it is guessing at certain common 
patterns used for logical structure (ex> hanging indents).   But to actuall 
call it 'LogicalTextStripper' seem way too ambitious for what it does and might 
scare people from trying to use it.

Technically, it can be made (through the control options) to behave like the 
current PDFTextStripper so it *could* simply replace it.  However its default 
behavior is different so I don't know if that would break anyone who depends on 
the current stripper behavior.

How about "PDFStructuredTextStripper" ?   It does try to retain the 
page/article structure of the document, as well as extracting the paragraphs.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to