Re: Behavior unpack vs rmeta endpoints

Tim Allison Tue, 23 Nov 2021 05:41:14 -0800

I went with a default of xhtml on /rmeta to mirror the behavior of
/tika, which is default xhtml.


Would you want /unpack functionality in /rmeta?  Same output now as
/rmeta, but there would be a X-TIKA:raw_bytes field (or similar) that
included base64 encoded bytes from attachments.

Cheers,

        Tim

On Tue, Nov 23, 2021 at 4:07 AM <[email protected]> wrote:
>
> Hi Tim,
>
> thanks for your answer ! Yes I wrongly thought that the default handler for 
> the /rmeta endpoint would be ToTextHandler but since it has two other 
> endpoints /rmeta/text and /rmeta/body it makes sense !
>
> Best regards,
> Julien
>
> -----Message d'origine-----
> De : Tim Allison <[email protected]>
> Envoyé : lundi 22 novembre 2021 23:13
> À : <[email protected]> <[email protected]>
> Objet : Re: Behavior unpack vs rmeta endpoints
>
> Hi Julien,
>   I'm sorry for my delay.  The following are with 1.27.
>
> When I run /rmeta, I get xhtml that correctly escapes the &amp; (I think?), 
> because the default /rmeta content format is xhtml:
> <html xmlns:"...><body>Parse \u0026amp; extract</body></html>
>
> When I run /rmeta/text, I get the text with the &amp correctly converted:
> \n\n\n\n\n\n\n\n\n\n                Parse \u0026 extract\n\n
>
> When I run /unpack/all, I get the text with the &amp; correctly converted:
> Parse & extract
>
> So, I think the difference you are seeing is that the default handler in 
> /rmeta is the xhtml content handler, whereas it is the ToTextHandler in 
> /unpack.
>
>   I acknowledge that I may have misunderstood your question, though.
> Please let me know if this helps.
>
>
>          Best,
>
>                 Tim
>
> P.S.  I've been thinking about adding the /unpack functionality to the /rmeta 
> endpoint and including the bytes of attachments (base64 encoded and or page 
> renderings).  Would this be of any interest?  One key limitation of /unpack 
> is that it doesn't handle attachments recursively IIRC.
>
>
> On Fri, Nov 19, 2021 at 5:32 AM <[email protected]> wrote:
>
> > Hi,
> >
> >
> >
> > I noticed a different behavior concerning the treatment of an XHTML
> > document between the /unpack endpoint and the /rmeta endpoint on Tika
> > Server v1.27 (in auto detect)
> >
> >
> >
> > My input document is an XHTML document containing HTML escaped ‘&’ (so
> > ‘&amp ;’), and the resulting output of the /unpack endpoint is a text
> > with unescaped ‘&’ where the output of the /rmeta endpoint is a text
> > still containing the escaped form ‘&amp ;’
> >
> >
> >
> > I am wondering if it is a normal behavior or not ?
> >
> >
> >
> > It can be easily tested with a simple test.html file containing :
> >
> >
> >
> > <html>
> >
> >         <body>
> >
> >                 Parse &amp; extract
> >
> >         </body>
> >
> > </html>
> >
> >
> >
> > Regards,
> >
> >
> >
> > Julien Massiera
> >
> > Responsable produit
> >
> > France Labs – Makers of Datafari Enteprise Search
> > <https://www.datafari.com/en> Datafari Enterprise Search -
> > Retrouvez-nous à Open Source eXPerience 2021
> > <https://www.opensource-experience.com/> les 9 et 10 novembre
> > [image: Logo_OXSP] <https://www.opensource-experience.com/>
> >
> >
> >
>

Re: Behavior unpack vs rmeta endpoints

Reply via email to