[
https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040999#comment-14040999
]
Tim Allison commented on PDFBOX-1130:
-------------------------------------
User error. Subclasses of TextStripper must remember to call
super.writeParagraphStart() and super.writeParagraphEnd() to get the correct
number of starts and ends. My fault. No need to open new ticket. Thank you.
> ExtractText -html doesn't always close the <p> tags it opens
> ------------------------------------------------------------
>
> Key: PDFBOX-1130
> URL: https://issues.apache.org/jira/browse/PDFBOX-1130
> Project: PDFBox
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Fix For: 1.8.0
>
> Attachments: 000086.pdf, PDFBOX-1130.patch
>
>
> I have a test document (same one on PDFBOX-1129), which when run through
> ExtractText -html, extracts the page number for each page, however in each
> case the page number looks like:
> <p>N<p>Text of page N...
> Ie, the <p> tag for the page number wasn't closed.
> Maybe related: if I run ExtractText without html, there is not space after
> the page number and before the next word, ie I see words like 1Massachusetts,
> 2Course, 3also, 4the.
--
This message was sent by Atlassian JIRA
(v6.2#6252)