I went with a default of xhtml on /rmeta to mirror the behavior of
/tika, which is default xhtml.
Would you want /unpack functionality in /rmeta? Same output now as
/rmeta, but there would be a X-TIKA:raw_bytes field (or similar) that
included base64 encoded bytes from attachments.
Cheers,
Tim
On Tue, Nov 23, 2021 at 4:07 AM <[email protected]> wrote:
>
> Hi Tim,
>
> thanks for your answer ! Yes I wrongly thought that the default handler for
> the /rmeta endpoint would be ToTextHandler but since it has two other
> endpoints /rmeta/text and /rmeta/body it makes sense !
>
> Best regards,
> Julien
>
> -----Message d'origine-----
> De : Tim Allison <[email protected]>
> Envoyé : lundi 22 novembre 2021 23:13
> À : <[email protected]> <[email protected]>
> Objet : Re: Behavior unpack vs rmeta endpoints
>
> Hi Julien,
> I'm sorry for my delay. The following are with 1.27.
>
> When I run /rmeta, I get xhtml that correctly escapes the & (I think?),
> because the default /rmeta content format is xhtml:
> <html xmlns:"...><body>Parse \u0026amp; extract</body></html>
>
> When I run /rmeta/text, I get the text with the & correctly converted:
> \n\n\n\n\n\n\n\n\n\n Parse \u0026 extract\n\n
>
> When I run /unpack/all, I get the text with the & correctly converted:
> Parse & extract
>
> So, I think the difference you are seeing is that the default handler in
> /rmeta is the xhtml content handler, whereas it is the ToTextHandler in
> /unpack.
>
> I acknowledge that I may have misunderstood your question, though.
> Please let me know if this helps.
>
>
> Best,
>
> Tim
>
> P.S. I've been thinking about adding the /unpack functionality to the /rmeta
> endpoint and including the bytes of attachments (base64 encoded and or page
> renderings). Would this be of any interest? One key limitation of /unpack
> is that it doesn't handle attachments recursively IIRC.
>
>
> On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote:
>
> > Hi,
> >
> >
> >
> > I noticed a different behavior concerning the treatment of an XHTML
> > document between the /unpack endpoint and the /rmeta endpoint on Tika
> > Server v1.27 (in auto detect)
> >
> >
> >
> > My input document is an XHTML document containing HTML escaped ‘&’ (so
> > ‘& ;’), and the resulting output of the /unpack endpoint is a text
> > with unescaped ‘&’ where the output of the /rmeta endpoint is a text
> > still containing the escaped form ‘& ;’
> >
> >
> >
> > I am wondering if it is a normal behavior or not ?
> >
> >
> >
> > It can be easily tested with a simple test.html file containing :
> >
> >
> >
> > <html>
> >
> > <body>
> >
> > Parse & extract
> >
> > </body>
> >
> > </html>
> >
> >
> >
> > Regards,
> >
> >
> >
> > Julien Massiera
> >
> > Responsable produit
> >
> > France Labs – Makers of Datafari Enteprise Search
> > <https://www.datafari.com/en> Datafari Enterprise Search -
> > Retrouvez-nous à Open Source eXPerience 2021
> > <https://www.opensource-experience.com/> les 9 et 10 novembre
> > [image: Logo_OXSP] <https://www.opensource-experience.com/>
> >
> >
> >
>