Hi,

 

I noticed a different behavior concerning the treatment of an XHTML document
between the /unpack endpoint and the /rmeta endpoint on Tika Server v1.27
(in auto detect)

 

My input document is an XHTML document containing HTML escaped ‘&’ (so ‘&amp
;’), and the resulting output of the /unpack endpoint is a text with
unescaped ‘&’ where the output of the /rmeta endpoint is a text still
containing the escaped form ‘&amp ;’ 

 

I am wondering if it is a normal behavior or not ? 

 

It can be easily tested with a simple test.html file containing :

 

<html>

        <body>

                Parse &amp; extract

        </body>

</html>

 

Regards,

 

Julien Massiera

Responsable produit

France Labs – Makers of  <https://www.datafari.com/en> Datafari Enteprise
Search
Datafari Enterprise Search - Retrouvez-nous à
<https://www.opensource-experience.com/> Open Source eXPerience 2021 les 9
et 10 novembre
 <https://www.opensource-experience.com/> 

 

Reply via email to