Hi Tim,

thanks for your answer ! Yes I wrongly thought that the default handler for the 
/rmeta endpoint would be ToTextHandler but since it has two other endpoints 
/rmeta/text and /rmeta/body it makes sense ! 

Best regards,
Julien   

-----Message d'origine-----
De : Tim Allison <[email protected]> 
Envoyé : lundi 22 novembre 2021 23:13
À : <[email protected]> <[email protected]>
Objet : Re: Behavior unpack vs rmeta endpoints

Hi Julien,
  I'm sorry for my delay.  The following are with 1.27.

When I run /rmeta, I get xhtml that correctly escapes the &amp; (I think?), 
because the default /rmeta content format is xhtml:
<html xmlns:"...><body>Parse \u0026amp; extract</body></html>

When I run /rmeta/text, I get the text with the &amp correctly converted:
\n\n\n\n\n\n\n\n\n\n                Parse \u0026 extract\n\n

When I run /unpack/all, I get the text with the &amp; correctly converted:
Parse & extract

So, I think the difference you are seeing is that the default handler in /rmeta 
is the xhtml content handler, whereas it is the ToTextHandler in /unpack.

  I acknowledge that I may have misunderstood your question, though.
Please let me know if this helps.


         Best,

                Tim

P.S.  I've been thinking about adding the /unpack functionality to the /rmeta 
endpoint and including the bytes of attachments (base64 encoded and or page 
renderings).  Would this be of any interest?  One key limitation of /unpack is 
that it doesn't handle attachments recursively IIRC.


On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote:

> Hi,
>
>
>
> I noticed a different behavior concerning the treatment of an XHTML 
> document between the /unpack endpoint and the /rmeta endpoint on Tika 
> Server v1.27 (in auto detect)
>
>
>
> My input document is an XHTML document containing HTML escaped ‘&’ (so 
> ‘&amp ;’), and the resulting output of the /unpack endpoint is a text 
> with unescaped ‘&’ where the output of the /rmeta endpoint is a text 
> still containing the escaped form ‘&amp ;’
>
>
>
> I am wondering if it is a normal behavior or not ?
>
>
>
> It can be easily tested with a simple test.html file containing :
>
>
>
> <html>
>
>         <body>
>
>                 Parse &amp; extract
>
>         </body>
>
> </html>
>
>
>
> Regards,
>
>
>
> Julien Massiera
>
> Responsable produit
>
> France Labs – Makers of Datafari Enteprise Search 
> <https://www.datafari.com/en> Datafari Enterprise Search - 
> Retrouvez-nous à Open Source eXPerience 2021 
> <https://www.opensource-experience.com/> les 9 et 10 novembre
> [image: Logo_OXSP] <https://www.opensource-experience.com/>
>
>
>

Reply via email to