[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Mel Martinez (JIRA) Wed, 24 Nov 2010 16:47:44 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935597#action_12935597
 ]


Mel Martinez commented on PDFBOX-521:
-------------------------------------

One more issue:

This is minor, but the 'PDFTextStripper.writePageSeparator()' method is no 
longer used as the new logic instead uses discrete 'writePageStart()' and 
'writePageEnd()' calls.

Similarly, the member field 'pageSeparator' is no longer needed.

Ideally, the following should be removed from PDFTextStripper:

private String pageSeparator;   
public String getPageSeparator();
public void setPageSeparator(String);
protected void writePageSeparator();

If for legacy reasons we want to preserve the API so that legacy client code 
does not break the proper approach is as follows:

1) Mark the three methods as deprecated.
2) Change the 'getPageSeparator()' method to return the catenation of 
'getPageEnd() + getPageStart()'.
3) Change the 'setPageSeparator(String)' method to do nothing OR to just pass 
through to 'setPageEnd(String)' and 'setPageStart("");
4) Remove the private 'pageSeparator' member field since it is no longer 
logically correct anyway.

Again, I apologize for not getting a chance to really look at this stuff prior 
to the v1.3.1 release.



> Improved PDF Text Extraction that notes paragraph boundaries
> ------------------------------------------------------------
>
>                 Key: PDFBOX-521
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-521
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>            Assignee: Andreas Lehmkühler
>         Attachments: pdftextstripper2.zip
>
>
> The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is 
> to ignore paragraph demarcation in the text.  It basically just renders each 
> line of text as it discovers it, separating each line equally with the same 
> line separator.
> This makes it difficult to identify paragraph (or even page) starts and stops 
> in the extracted text.  This is often necessary for text processing that 
> needs to work with logical 'chunks' of text.  Further, rendering into other 
> formats (such as HTML or XML) is facilitated by resolving the document into 
> more discrete logical text chunks.
> The request here is for improved text extraction that provides more discrete 
> instrumentation of the parsing, allowing one to identify / tag paragraph 
> starts and stops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-521) Improved PDF Text Extraction that notes paragraph boundaries

Reply via email to