[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the
tags it opens

Tim Allison (JIRA) Mon, 23 Jun 2014 10:29:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040999#comment-14040999
 ]


Tim Allison commented on PDFBOX-1130:
-------------------------------------

User error.  Subclasses of TextStripper must remember to call 
super.writeParagraphStart() and super.writeParagraphEnd() to get the correct 
number of starts and ends.  My fault.  No need to open new ticket. Thank you.

> ExtractText -html doesn't always close the <p> tags it opens
> ------------------------------------------------------------
>
>                 Key: PDFBOX-1130
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1130
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.8.0
>
>         Attachments: 000086.pdf, PDFBOX-1130.patch
>
>
> I have a test document (same one on PDFBOX-1129), which when run through 
> ExtractText -html, extracts the page number for each page, however in each 
> case the page number looks like:
>     <p>N<p>Text of page N...
> Ie, the <p> tag for the page number wasn't closed.
> Maybe related: if I run ExtractText without html, there is not space after 
> the page number and before the next word, ie I see words like 1Massachusetts, 
> 2Course, 3also, 4the.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the tags it opens

Reply via email to

[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the
tags it opens