Hi Julien,
  I'm sorry for my delay.  The following are with 1.27.

When I run /rmeta, I get xhtml that correctly escapes the & (I think?),
because the default /rmeta content format is xhtml:
<html xmlns:"...><body>Parse \u0026amp; extract</body></html>

When I run /rmeta/text, I get the text with the &amp correctly converted:
\n\n\n\n\n\n\n\n\n\n                Parse \u0026 extract\n\n

When I run /unpack/all, I get the text with the &amp; correctly converted:
Parse & extract

So, I think the difference you are seeing is that the default handler in
/rmeta is the xhtml content handler, whereas it is the ToTextHandler in
/unpack.

  I acknowledge that I may have misunderstood your question, though.
Please let me know if this helps.


         Best,

                Tim

P.S.  I've been thinking about adding the /unpack functionality to the
/rmeta endpoint and including the bytes of attachments (base64 encoded and
or page renderings).  Would this be of any interest?  One key limitation of
/unpack is that it doesn't handle attachments recursively IIRC.


On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote:

> Hi,
>
>
>
> I noticed a different behavior concerning the treatment of an XHTML
> document between the /unpack endpoint and the /rmeta endpoint on Tika
> Server v1.27 (in auto detect)
>
>
>
> My input document is an XHTML document containing HTML escaped ‘&’ (so
> ‘&amp ;’), and the resulting output of the /unpack endpoint is a text with
> unescaped ‘&’ where the output of the /rmeta endpoint is a text still
> containing the escaped form ‘&amp ;’
>
>
>
> I am wondering if it is a normal behavior or not ?
>
>
>
> It can be easily tested with a simple test.html file containing :
>
>
>
> <html>
>
>         <body>
>
>                 Parse &amp; extract
>
>         </body>
>
> </html>
>
>
>
> Regards,
>
>
>
> Julien Massiera
>
> Responsable produit
>
> France Labs – Makers of Datafari Enteprise Search
> <https://www.datafari.com/en>
> Datafari Enterprise Search - Retrouvez-nous à Open Source eXPerience 2021
> <https://www.opensource-experience.com/> les 9 et 10 novembre
> [image: Logo_OXSP] <https://www.opensource-experience.com/>
>
>
>

Reply via email to