[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Mon, 26 Jul 2010 15:16:44 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892505#action_12892505
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

Ted,

I haven't had a chance to try out the merged code, nor have I tried out your 
rather interesting PDF document as an example.   In fact its been a while since 
I was immersed in this.  So please bear with me.

If I recall, isn't the PDFTextStripper.setShouldSeparateByBeads(boolean) flag 
supposed to control whether the text is rendered by article thread versus by 
layout?

Does that not work correctly?   We always have it set to 'true' and so far all 
the documents that we are ingesting seem to be processed well enough.   I admit 
that our output requirements are not super stringent in terms of requiring a 
close visual approximation of the original PDF.  We mainly want the text blocks 
of each article to be correctly threaded into correct paragraph chunks.

I am probably misunderstanding what it is you are trying to achieve.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to