[
https://issues.apache.org/jira/browse/TIKA-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689337#comment-17689337
]
Tim Allison edited comment on TIKA-3972 at 2/15/23 8:50 PM:
------------------------------------------------------------
-It looks like our parser requires {{fldrslt}} as a hint that the field has
come to an end. {{Dip, Caesar.doc}} doesn't have that, but {{Blackening
Spice}} does.-
Sorry. That was wrong. Still looking.
Thank you for opening this issue and providing an example file!
was (Author: [email protected]):
It looks like our parser requires {{fldrslt}} as a hint that the field has come
to an end. {{Dip, Caesar.doc}} doesn't have that, but {{Blackening Spice}}
does.
Will take a bit to figure out how best to fix this. Thank you for opening this
issue and providing an example file!
> Parsing RTF sample with hyperlink and ToXMLContentHandler returns malformed
> XHTML from toString method call
> -----------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3972
> URL: https://issues.apache.org/jira/browse/TIKA-3972
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.7.0
> Environment: Tested with Java 8 (Temurin Eclipse) and Tika 2.7.0 on
> Windows 11.
> Reporter: Martin Honnen
> Priority: Major
> Labels: RTFParser, rtf
> Attachments: hyperlink.rtf
>
>
> I am exploring Tika for RTF to X(HT)ML parsing, I have run into a problem
> with some RTF having an hyperlink where unfortunately the result of using a
> ContentHandler created with ToXMLContentHandler and calling the toString()
> method on the handler returns a malformed X(HT)ML document where the starting
> `<a>` tag is not properly closed.
> I have attached the relevant RTF sample document. The output I get is
> ```
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"
> />
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
> <meta name="Content-Type" content="application/rtf" />
> <title></title>
> </head>
> <body><p />
> <p />
> <p> 10”Flour Tortilla</p>
> <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip,
> Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
> <p><b /> Ripped Romaine</p>
> <p> Blackened Salmon julienne</p>
> <p> Shaved Red Onion</p>
> <p> Julienne Tomato</p>
> <p> Grated Parmesan</p>
> <p> Blackening spice: <a href="..\\..\\SPICE\\Blackening
> Spice.doc">Blackening Spice.doc</a></p>
> <p />
> <p>Method</p>
> <p>Procedure Text </p>
> <p />
> <p />
> </body></html>
> ```
> where the part `<p> Caesar <b><i>DIP</i>: <a
> href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b
> /></b></p>` is flawed as the `<a href>` is not closed.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)