No it is ok Tim, everything should remain as it is. Thanks ! Julien
-----Message d'origine----- De : Tim Allison <[email protected]> Envoyé : mardi 23 novembre 2021 14:41 À : <[email protected]> <[email protected]> Objet : Re: Behavior unpack vs rmeta endpoints I went with a default of xhtml on /rmeta to mirror the behavior of /tika, which is default xhtml. Would you want /unpack functionality in /rmeta? Same output now as /rmeta, but there would be a X-TIKA:raw_bytes field (or similar) that included base64 encoded bytes from attachments. Cheers, Tim On Tue, Nov 23, 2021 at 4:07 AM <[email protected]> wrote: > > Hi Tim, > > thanks for your answer ! Yes I wrongly thought that the default handler for > the /rmeta endpoint would be ToTextHandler but since it has two other > endpoints /rmeta/text and /rmeta/body it makes sense ! > > Best regards, > Julien > > -----Message d'origine----- > De : Tim Allison <[email protected]> Envoyé : lundi 22 novembre 2021 > 23:13 À : <[email protected]> <[email protected]> Objet : Re: > Behavior unpack vs rmeta endpoints > > Hi Julien, > I'm sorry for my delay. The following are with 1.27. > > When I run /rmeta, I get xhtml that correctly escapes the & (I think?), > because the default /rmeta content format is xhtml: > <html xmlns:"...><body>Parse \u0026amp; extract</body></html> > > When I run /rmeta/text, I get the text with the & correctly converted: > \n\n\n\n\n\n\n\n\n\n Parse \u0026 extract\n\n > > When I run /unpack/all, I get the text with the & correctly converted: > Parse & extract > > So, I think the difference you are seeing is that the default handler in > /rmeta is the xhtml content handler, whereas it is the ToTextHandler in > /unpack. > > I acknowledge that I may have misunderstood your question, though. > Please let me know if this helps. > > > Best, > > Tim > > P.S. I've been thinking about adding the /unpack functionality to the /rmeta > endpoint and including the bytes of attachments (base64 encoded and or page > renderings). Would this be of any interest? One key limitation of /unpack > is that it doesn't handle attachments recursively IIRC. > > > On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote: > > > Hi, > > > > > > > > I noticed a different behavior concerning the treatment of an XHTML > > document between the /unpack endpoint and the /rmeta endpoint on > > Tika Server v1.27 (in auto detect) > > > > > > > > My input document is an XHTML document containing HTML escaped ‘&’ > > (so ‘& ;’), and the resulting output of the /unpack endpoint is a > > text with unescaped ‘&’ where the output of the /rmeta endpoint is a > > text still containing the escaped form ‘& ;’ > > > > > > > > I am wondering if it is a normal behavior or not ? > > > > > > > > It can be easily tested with a simple test.html file containing : > > > > > > > > <html> > > > > <body> > > > > Parse & extract > > > > </body> > > > > </html> > > > > > > > > Regards, > > > > > > > > Julien Massiera > > > > Responsable produit > > > > France Labs – Makers of Datafari Enteprise Search > > <https://www.datafari.com/en> Datafari Enterprise Search - > > Retrouvez-nous à Open Source eXPerience 2021 > > <https://www.opensource-experience.com/> les 9 et 10 novembre > > [image: Logo_OXSP] <https://www.opensource-experience.com/> > > > > > > >
