[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

JIRA Mon, 26 Jul 2010 11:01:58 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892394#action_12892394
 ]


Andreas Lehmkühler commented on PDFBOX-521:
-------------------------------------------

I added Mels patch with versions 979379,979381, but I have to make some changes:

1. I merged both new classes into the old ones
2. I rearranged/simplified some of the code
3. I had to change the normalize method, as it didn't work for rtl text.
The old implementation asked every TextPosition, if the logical order has to be 
changed. hello3.pdf from our test arena consists of 3 words "Hello محمد 
World.". There is one TextPosition for (nearly) every character. As it isn't 
possible to change the order of just one character, we have to combine the 
characters to words. Those can be reordered and everything works fine.

Any comments, further suggestions?




> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to