[ https://issues.apache.org/jira/browse/PDFBOX-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964602#action_12964602 ]
Andreas Lehmkühler edited comment on PDFBOX-521 at 11/28/10 5:46 PM: --------------------------------------------------------------------- I fixed 2 minor issues in revisions 1039871 (compile error introduced in revision 1039739) and 1039964 (typo: "/" instead of "*" led to a wrong calculation) was (Author: lehmi): I fixed 2 minor issues in revisions 1039871 and 1039964 > Improved PDF Text Extraction that notes paragraph boundaries > ------------------------------------------------------------ > > Key: PDFBOX-521 > URL: https://issues.apache.org/jira/browse/PDFBOX-521 > Project: PDFBox > Issue Type: Improvement > Components: Parsing > Affects Versions: 0.8.0-incubator > Environment: all > Reporter: Mel Martinez > Assignee: Andreas Lehmkühler > Attachments: pdftextstripper2.zip > > > The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is > to ignore paragraph demarcation in the text. It basically just renders each > line of text as it discovers it, separating each line equally with the same > line separator. > This makes it difficult to identify paragraph (or even page) starts and stops > in the extracted text. This is often necessary for text processing that > needs to work with logical 'chunks' of text. Further, rendering into other > formats (such as HTML or XML) is facilitated by resolving the document into > more discrete logical text chunks. > The request here is for improved text extraction that provides more discrete > instrumentation of the parsing, allowing one to identify / tag paragraph > starts and stops. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.