Hi Tim, thanks for your answer ! Yes I wrongly thought that the default handler for the /rmeta endpoint would be ToTextHandler but since it has two other endpoints /rmeta/text and /rmeta/body it makes sense !
Best regards, Julien -----Message d'origine----- De : Tim Allison <[email protected]> Envoyé : lundi 22 novembre 2021 23:13 À : <[email protected]> <[email protected]> Objet : Re: Behavior unpack vs rmeta endpoints Hi Julien, I'm sorry for my delay. The following are with 1.27. When I run /rmeta, I get xhtml that correctly escapes the & (I think?), because the default /rmeta content format is xhtml: <html xmlns:"...><body>Parse \u0026amp; extract</body></html> When I run /rmeta/text, I get the text with the & correctly converted: \n\n\n\n\n\n\n\n\n\n Parse \u0026 extract\n\n When I run /unpack/all, I get the text with the & correctly converted: Parse & extract So, I think the difference you are seeing is that the default handler in /rmeta is the xhtml content handler, whereas it is the ToTextHandler in /unpack. I acknowledge that I may have misunderstood your question, though. Please let me know if this helps. Best, Tim P.S. I've been thinking about adding the /unpack functionality to the /rmeta endpoint and including the bytes of attachments (base64 encoded and or page renderings). Would this be of any interest? One key limitation of /unpack is that it doesn't handle attachments recursively IIRC. On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote: > Hi, > > > > I noticed a different behavior concerning the treatment of an XHTML > document between the /unpack endpoint and the /rmeta endpoint on Tika > Server v1.27 (in auto detect) > > > > My input document is an XHTML document containing HTML escaped ‘&’ (so > ‘& ;’), and the resulting output of the /unpack endpoint is a text > with unescaped ‘&’ where the output of the /rmeta endpoint is a text > still containing the escaped form ‘& ;’ > > > > I am wondering if it is a normal behavior or not ? > > > > It can be easily tested with a simple test.html file containing : > > > > <html> > > <body> > > Parse & extract > > </body> > > </html> > > > > Regards, > > > > Julien Massiera > > Responsable produit > > France Labs – Makers of Datafari Enteprise Search > <https://www.datafari.com/en> Datafari Enterprise Search - > Retrouvez-nous à Open Source eXPerience 2021 > <https://www.opensource-experience.com/> les 9 et 10 novembre > [image: Logo_OXSP] <https://www.opensource-experience.com/> > > >
