Martin Honnen created TIKA-3972:
-----------------------------------

             Summary: Parsing RTF sample with hyperlink and ToXMLContentHandler 
returns malformed XHTML from toString method call
                 Key: TIKA-3972
                 URL: https://issues.apache.org/jira/browse/TIKA-3972
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.7.0
         Environment: Tested with Java 8 (Temurin Eclipse) and Tika 2.7.0 on 
Windows 11.
            Reporter: Martin Honnen
         Attachments: hyperlink.rtf

I am exploring Tika for RTF to X(HT)ML parsing, I have run into a problem with 
some RTF having an hyperlink where unfortunately the result of using a 
ContentHandler created with ToXMLContentHandler and calling the toString() 
method on the handler returns a malformed X(HT)ML document where the starting 
`<a>` tag is not properly closed.

I have attached the relevant RTF sample document. The output I get is

```

<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" 
content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
<meta name="Content-Type" content="application/rtf" />
<title></title>
</head>
<body><p />
<p />
<p>    10”Flour Tortilla</p>
<p>    Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, 
Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
<p><b />    Ripped Romaine</p>
<p>    Blackened Salmon julienne</p>
<p>    Shaved Red Onion</p>
<p>    Julienne Tomato</p>
<p>    Grated Parmesan</p>
<p>    Blackening spice: <a href="..\\..\\SPICE\\Blackening 
Spice.doc">Blackening Spice.doc</a></p>
<p />
<p>Method</p>
<p>Procedure Text </p>
<p />
<p />
</body></html>

```

where the part `<p>    Caesar <b><i>DIP</i>: <a 
href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>` 
is flawed as the `<a href>` is not closed.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to