[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892525#action_12892525
]
Ted Dunning commented on PDFBOX-521:
------------------------------------
{quote}
If I recall, isn't the PDFTextStripper.setShouldSeparateByBeads(boolean) flag
supposed to control whether the text is rendered by article thread versus by
layout?
{quote}
It is supposed to, but lots of these documents have text in very strange orders
internally to the PDF (which breaks the non-bead version) and have
geometrically difficult to separate text regions which defeats simple y-order
sorts. When I tried this a while ago, no option that I found on vanilla
PDFTextStripper would provide even marginally readable text for this document
(and many, many others that are similar). On some pages, the columns are
rendered out of order, footnotes appear in the middle of the text and there are
some very creative uses of fonts.
But that said, I am months out of date on trying this. I may get back to this
problem in a month or four and be able to give you a real answer.
{quote}
We mainly want the text blocks of each article to be correctly threaded into
correct paragraph chunks.
I am probably misunderstanding what it is you are trying to achieve.
{quote}
I think that we are after something similar. My own goals would be
a) render a plain text version of the document that reads well a page at a time
(not just a paragraph at a time)
OR
b) render a half-graphic version of the document with as much text extracted in
as large a chunk as practical (word level is fine, really) with precise xy and
font information retained. For text that cannot quickly be seen to be
renderable in a web browser (probably due to font or size limits) and for all
graphical content, I would like render everything onto a background image. The
ultimate goal is HTML rendering of a visually accurate page image with much
smaller storage requirements. Presumably with almost all of the text removed
from the background image, the size of the background image should be very
small for almost all pages.
Obviously, the more that I can align with other peoples' goals, the more we can
share the implementation work.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.