Thanks for the insight. My interest (as a developer) in TikaJAXRS is that it provides a nice encapsulation of Tika functionality which is accessible across language boundaries. The fact that it can then also cross network boundaries is of secondary importance to me.
I'm developing code in C++ and I'd like to be able to access Tika's capabilities. The TikaJAXRS offers an easy way in. If the fileURL functionality was in place and running TikaJAXRS on the same box as the Client and restricted to listening on 127.0.0.1 with the file:// check as well, this would limit some of the dangers listed below - an attacker would then need access to your host box itself in which case you would have already lost. My main concern in accessing the Tika libraries via TikaJAXRS is the performance overheads associated with going through sockets (and possible the additional memory/file copying of file data if fileUrl is not available). Short of the Herculean task of porting the entirety of Tika from java to C++, are there any better, well-established, more performant ways of interfacing to Tika from C++ to the java Tika code ? Regards, John -----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: 13 September 2016 15:34 To: John Dougrez-Lewis Cc: [email protected] Subject: RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working On Tue, 13 Sep 2016, John Dougrez-Lewis wrote: > Surely the security vulnerability could have been fixed by disallowing > "file://" variants in the URL rather than removing the feature altogether? > > Or were there other implementation issues relating to the fileUrl > feature that meant it was best removed ? As the fetch is done by the server, it could allow you to fetch documents that you as a user couldn't see/access/reach but the server could. It also has some denial of service risks too, plus doesn't have things you want from a web spider like pools / limits / robots.txt acceptance etc. Nick
