Thank you, Nick. For the reasons you listed, I'm averse to adding fileUrl
back, but I'm not entirely against it.
Would it be as much of a disaster to require the user to allow the fileUrl
capability on the commandline at server startup? We could add some menacing
"all bets are off, we hope you know what you're doing" warning.
If we went with something like this, we could allow all urls, and users
wouldn't have to ship the bytes via the network, tika-server could read local
files from the file share.
This might still be a remarkably bad idea...
Cheers,
Tim
P.S.
> My main concern in accessing the Tika libraries via TikaJAXRS is the
> performance overheads associated ?>with going through sockets (and possible
> the additional memory/file copying of file data if fileUrl is not >available).
In my experience, depending on the file types, y, there's definitely some
overhead, but the bottleneck is in the parsers (esp for complex document
formats -- msoffice, pdf, etc), not data sloshing.
-----Original Message-----
From: John Dougrez-Lewis [mailto:[email protected]]
Sent: Wednesday, September 14, 2016 2:35 AM
To: [email protected]
Cc: 'Nick Burch' <[email protected]>
Subject: RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract
document at remote url - my request is not working
Thanks for the insight.
My interest (as a developer) in TikaJAXRS is that it provides a nice
encapsulation of Tika functionality which is accessible across language
boundaries. The fact that it can then also cross network boundaries is of
secondary importance to me.
I'm developing code in C++ and I'd like to be able to access Tika's
capabilities.
The TikaJAXRS offers an easy way in. If the fileURL functionality was in place
and running TikaJAXRS on the same box as the Client and restricted to listening
on 127.0.0.1 with the file:// check as well, this would limit some of the
dangers listed below - an attacker would then need access to your host box
itself in which case you would have already lost.
My main concern in accessing the Tika libraries via TikaJAXRS is the
performance overheads associated with going through sockets (and possible the
additional memory/file copying of file data if fileUrl is not available).
Short of the Herculean task of porting the entirety of Tika from java to
C++, are there any better, well-established, more performant ways of
interfacing to Tika from C++ to the java Tika code ?
Regards,
John