Funbit created TIKA-2897: ---------------------------- Summary: Invalid XHTML output for some OpenOffice files (created in LibreOffice Impress) Key: TIKA-2897 URL: https://issues.apache.org/jira/browse/TIKA-2897 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.21 Environment: Command line to reproduce:
{color:#205081}java -jar tika-app.jar --xml Impress.odp{color} Reporter: Funbit Attachments: Impress.odp The XHTML output produced by the Tika 1.21 is invalid for some LibreOffice documents. The sample document (created in LibreOffice 6.1.5) is attached. Here is the sample output (the <p> tag is not closed, any XHTML parser will fail to parse that): {{<p class="notes"><div/>}} {{</notes><div><p>SECOND PAGE</p>}} {{</div>}} {{<div><ul> <li><p>Text on the second page</p>}} {{</li>}} {{</ul>}} {{</div>}} {{{color:#FF0000}<p class="notes">{color}<div/>}} {{</notes></body></html>}} Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)