[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Ted Dunning (JIRA) Mon, 26 Jul 2010 15:54:43 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892525#action_12892525
 ]


Ted Dunning commented on PDFBOX-521:
------------------------------------

{quote}
If I recall, isn't the PDFTextStripper.setShouldSeparateByBeads(boolean) flag 
supposed to control whether the text is rendered by article thread versus by 
layout? 
{quote}
It is supposed to, but lots of these documents have text in very strange orders 
internally to the PDF (which breaks the non-bead version) and have 
geometrically difficult to separate text regions which defeats simple y-order 
sorts.  When I tried this a while ago, no option that I found on vanilla 
PDFTextStripper would provide even marginally readable text for this document 
(and many, many others that are similar).  On some pages, the columns are 
rendered out of order, footnotes appear in the middle of the text and there are 
some very creative uses of fonts.

But that said, I am months out of date on trying this.  I may get back to this 
problem in a month or four and be able to give you a real answer.
{quote}
We mainly want the text blocks of each article to be correctly threaded into 
correct paragraph chunks. 

I am probably misunderstanding what it is you are trying to achieve. 
{quote}
I think that we are after something similar.  My own goals would be

a) render a plain text version of the document that reads well a page at a time 
(not just a paragraph at a time)

OR 

b) render a half-graphic version of the document with as much text extracted in 
as large a chunk as practical (word level is fine, really) with precise xy and 
font information retained.  For text that cannot quickly be seen to be 
renderable in a web browser (probably due to font or size limits) and for all 
graphical content, I would like render everything onto a background image.  The 
ultimate goal is HTML rendering of a visually accurate page image with much 
smaller storage requirements.  Presumably with almost all of the text removed 
from the background image, the size of the background image should be very 
small for almost all pages.

Obviously, the more that I can align with other peoples' goals, the more we can 
share the implementation work.


> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to