[ 
https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541423#comment-13541423
 ] 

Aaptha commented on PDFBOX-1213:
--------------------------------

Can someone post pseudocode for this? It is quite difficult to understand how 
to go about retaining the Positions in the Normalize(). It would be great if 
someone could put an effort to add a patch for this. I have been struggling 
from a long time to get this. But it is not working.

PDFBOX-213 is duplicate. The solution proposed in it is old, and does not fit 
into what we have in the trunk.

The diff attached here in PDFBOX-1213 is going on adding the style tags 'line 
by line', instead it should have added these style tags into the html word by 
word. The PDFTextStripper is not allowing such access to the stripped text.

Please help all those interested in such feature by providing a patch!
                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style 
> information (bold, italic and size font) in the resulting file. Moreover, we 
> have deleted the "DOCTYPE" header because some parsers throws the following 
> exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" 
> must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" 
> must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to