[jira] [Commented] (TIKA-2168) Incorrect and
parsing in PdfParser

Tim Allison (JIRA) Mon, 07 Nov 2016 06:35:21 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644339#comment-15644339
 ]


Tim Allison commented on TIKA-2168:
-----------------------------------

Where are you seeing problems with <p/>? I recognize this is bad html, but IE, 
Chrome, Excel and a few others seem to be ok with: 
{noformat}
<html>
<body>
the
<p> quick </p>
brown
<p/>
fox

</body>
</html>
{noformat}

> Incorrect <a> and <p> parsing in PdfParser
> ------------------------------------------
>
>                 Key: TIKA-2168
>                 URL: https://issues.apache.org/jira/browse/TIKA-2168
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.13
>         Environment: Running Tika server 1.13 and testing via http api 
>            Reporter: Sara Miller
>
> PdfParser returns self-closing tags for{code:xml}<a/>{code} and 
> {code:xml}<p/>{code}, which is not html supported and does not render 
> correctly in any browsers.
> {code:xml}<a href="https://wiki.apache.org/tika/TikaJAXRS"/>{code} in the 
> example below should be {code:xml}<a 
> ref="https://wiki.apache.org/tika/TikaJAXRS";></a>{code}
> We have tested both pdf converted from word and google documents with the 
> same results. This is an example output that we get when parsing a 
> pdf-document with a link:
>  
> {code:xml}
> <html xmlns="http://www.w3.org/1999/xhtml";>
>     <head>
>         <meta name="date" content="2016-11-07T07:51:14Z"/>
>         <meta name="pdf:PDFVersion" content="1.5"/>
>         <meta name="xmp:CreatorTool" content="Microsoft&reg; Word 2016"/>
>         <meta name="access_permission:modify_annotations" content="true"/>
>         <meta name="access_permission:can_print_degraded" content="true"/>
>         <meta name="dcterms:created" content="2016-11-07T07:51:14Z"/>
>         <meta name="Last-Modified" content="2016-11-07T07:51:14Z"/>
>         <meta name="dcterms:modified" content="2016-11-07T07:51:14Z"/>
>         <meta name="dc:format" content="application/pdf; version=1.5"/>
>         <meta name="xmpMM:DocumentID" 
> content="uuid:7C86A62C-A4B2-464A-AAEC-5524E170E2AF"/>
>         <meta name="Last-Save-Date" content="2016-11-07T07:51:14Z"/>
>         <meta name="access_permission:fill_in_form" content="true"/>
>         <meta name="meta:save-date" content="2016-11-07T07:51:14Z"/>
>         <meta name="pdf:encrypted" content="false"/>
>         <meta name="modified" content="2016-11-07T07:51:14Z"/>
>         <meta name="Content-Type" content="application/pdf"/>
>         <meta name="X-Parsed-By" 
> content="org.apache.tika.parser.DefaultParser"/>
>         <meta name="X-Parsed-By" 
> content="org.apache.tika.parser.pdf.PDFParser"/>
>         <meta name="meta:creation-date" content="2016-11-07T07:51:14Z"/>
>         <meta name="created" content="Mon Nov 07 07:51:14 UTC 2016"/>
>         <meta name="access_permission:extract_for_accessibility" 
> content="true"/>
>         <meta name="access_permission:assemble_document" content="true"/>
>         <meta name="xmpTPg:NPages" content="1"/>
>         <meta name="Creation-Date" content="2016-11-07T07:51:14Z"/>
>         <meta name="access_permission:extract_content" content="true"/>
>         <meta name="access_permission:can_print" content="true"/>
>         <meta name="producer" content="Microsoft&reg; Word 2016"/>
>         <meta name="access_permission:can_modify" content="true"/>
>         <title></title>
>     </head>
>     <body>
>         <div class="page">
>             <p/>
>             <p>This is a word document, converted to pdf.  
> </p>
>             <p>Example link: https://wiki.apache.org/tika/TikaJAXRS 
> </p>
>             <p> </p>
>             <p/>
>             <div class="annotation">
>                 <a href="https://wiki.apache.org/tika/TikaJAXRS"/>
>             </div>
>         </div>
>     </body>
> </html>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2168) Incorrect and parsing in PdfParser

Reply via email to

[jira] [Commented] (TIKA-2168) Incorrect and
parsing in PdfParser