Behavior unpack vs rmeta endpoints

julien.massiera Fri, 19 Nov 2021 02:32:16 -0800

Hi,


I noticed a different behavior concerning the treatment of an XHTML document
between the /unpack endpoint and the /rmeta endpoint on Tika Server v1.27
(in auto detect)

 

My input document is an XHTML document containing HTML escaped & (so &amp
;), and the resulting output of the /unpack endpoint is a text with
unescaped & where the output of the /rmeta endpoint is a text still
containing the escaped form &amp ; 

 

I am wondering if it is a normal behavior or not ? 

 

It can be easily tested with a simple test.html file containing :

 

<html>

        <body>

                Parse &amp; extract

        </body>

</html>

 

Regards,

 

Julien Massiera

Responsable produit

France Labs  Makers of  <https://www.datafari.com/en> Datafari Enteprise
Search
Datafari Enterprise Search - Retrouvez-nous à
<https://www.opensource-experience.com/> Open Source eXPerience 2021 les 9
et 10 novembre
 <https://www.opensource-experience.com/>

Behavior unpack vs rmeta endpoints

Reply via email to