Hi Julien,
I'm sorry for my delay. The following are with 1.27.
When I run /rmeta, I get xhtml that correctly escapes the & (I think?),
because the default /rmeta content format is xhtml:
<html xmlns:"...><body>Parse \u0026amp; extract</body></html>
When I run /rmeta/text, I get the text with the & correctly converted:
\n\n\n\n\n\n\n\n\n\n Parse \u0026 extract\n\n
When I run /unpack/all, I get the text with the & correctly converted:
Parse & extract
So, I think the difference you are seeing is that the default handler in
/rmeta is the xhtml content handler, whereas it is the ToTextHandler in
/unpack.
I acknowledge that I may have misunderstood your question, though.
Please let me know if this helps.
Best,
Tim
P.S. I've been thinking about adding the /unpack functionality to the
/rmeta endpoint and including the bytes of attachments (base64 encoded and
or page renderings). Would this be of any interest? One key limitation of
/unpack is that it doesn't handle attachments recursively IIRC.
On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote:
> Hi,
>
>
>
> I noticed a different behavior concerning the treatment of an XHTML
> document between the /unpack endpoint and the /rmeta endpoint on Tika
> Server v1.27 (in auto detect)
>
>
>
> My input document is an XHTML document containing HTML escaped ‘&’ (so
> ‘& ;’), and the resulting output of the /unpack endpoint is a text with
> unescaped ‘&’ where the output of the /rmeta endpoint is a text still
> containing the escaped form ‘& ;’
>
>
>
> I am wondering if it is a normal behavior or not ?
>
>
>
> It can be easily tested with a simple test.html file containing :
>
>
>
> <html>
>
> <body>
>
> Parse & extract
>
> </body>
>
> </html>
>
>
>
> Regards,
>
>
>
> Julien Massiera
>
> Responsable produit
>
> France Labs – Makers of Datafari Enteprise Search
> <https://www.datafari.com/en>
> Datafari Enterprise Search - Retrouvez-nous à Open Source eXPerience 2021
> <https://www.opensource-experience.com/> les 9 et 10 novembre
> [image: Logo_OXSP] <https://www.opensource-experience.com/>
>
>
>