[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Tue, 22 Sep 2009 13:54:41 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mel Martinez updated PDFBOX-521:
--------------------------------

    Attachment: pdftextstripper2.zip

Modified so as to re-enable use of the writeWordSeparator() and 
writeCharacters(TextPosition) methods, improving instrumentation available to 
sub-classes.

I.E. - you can now override the discrete character output methods as well as 
the various sectional boundaries.

This goes towards addressing the concerns of issue PDFBOX-533.

This also fixes a bug where non-RTL text was skipping presentation 
normalization - now ligatures and special characters are properly processed to 
replace them with their plain text equivalents (looks much better!).

Performance seems to be virtually unchanged from the previous version, taking 
just a hair over 40s to process the 31MB 2006 PDF 1.7 reference doc.

Needs to be tested with RTL text (i.e. Hebrew).  I don't have any such 
documents with which to test.  If anyone has some and would like to send me an 
example please do.  Or test it yourself and post the results here.

> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to