Cheng Leong created PDFBOX-1860: ----------------------------------- Summary: HTML converter escapes formatting close tags Key: PDFBOX-1860 URL: https://issues.apache.org/jira/browse/PDFBOX-1860 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.3 Reporter: Cheng Leong Priority: Minor Attachments: pdftest.pdf
Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information. Bold style tags are opened correctly, but the close tags are html-escaped. {noformat} ~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar ExtractText -html -nonSeq -console pdftest.pdf <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>1725.PDF</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <div style="page-break-before:always; page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, IPM, University of Liverpool </p> <p><b>A VERY SMALL PDF FILE </b></p> <p><b>A VERY SMALL PDF FILE </b></p> <p><b>A VERY SMALL PDF FILE </b></p> <p><b>A VERY SMALL PDF FILE </b></p> <p><b>A VERY SMALL PDF FILE </b></p> <p><b>A VERY SMALL PDF FILE</b></p> </div></div> </body></html> {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)