[
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936111#action_12936111
]
Mel Martinez commented on PDFBOX-521:
-------------------------------------
I am starting to feel like a real nag, but ...
In PDFTextStripper, the static initializer block that is used to read the
System Properties "pdftextstripper.indent" and "pdftextstripper.drop" needs to
be moved to _below_ the two fields that it sets (DEFAULT_INDENT_THRESHOLD and
DEFAULT_DROP_THRESHOLD) and before either gets used.
In its current location, it gets over-ruled by the default assignments.
I will try to code up a rewrite of PDFStreamEngine, PositionWrapper and
PDFTextStripper and post svn diff files that roll up all these needed changes.
> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
> Key: PDFBOX-521
> URL: https://issues.apache.org/jira/browse/PDFBOX-521
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Assignee: Andreas Lehmkühler
> Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is
> to ignore paragraph demarcation in the text. It basically just renders each
> line of text as it discovers it, separating each line equally with the same
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops
> in the extracted text. This is often necessary for text processing that
> needs to work with logical 'chunks' of text. Further, rendering into other
> formats (such as HTML or XML) is facilitated by resolving the document into
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete
> instrumentation of the parsing, allowing one to identify / tag paragraph
> starts and stops.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.