Re: Dependencies needed to just use TikaFileTypeDetector

Tim Allison Tue, 20 Nov 2018 12:57:01 -0800

Sorry for the late reply!

>Would it be okay to exclude tika-parsers which makes up the biggest
module without losing file type detection abilities?


It _should work_ for straight mime/magic detection.  However, you need
some of the items within tika-parsers to figure out specific POIFS
(OLE/ooxml) and package types...e.g. what type of an ooxml or OLE file
is it, really, etc.  These are the key classes that you'd be missing
without tika-parsers:
org.apache.tika.parser.microsoft.POIFSContainerDetector
org.apache.tika.parser.pkg.ZipContainerDetector  And, you'd also need
to include POI and commons-compress as dependencies.  In
short...experiment with those file types and see if the results are ok
for you.

>But that's  some kind of white box knowledge to rely on which is not that 
>good. Can
you somehow provide different implementations of
java.nio.file.spi.FileTypeDetector, say "eager" and "lazy"?

I wouldn't object to this.  To confirm, we'd have two implementations
in the java7 package and you'd do something like this in your code:
        
Class.forName("org.apache.tika.filetypedetector.TikaBytesFileTypeDetector");

Open a ticket on our JIRA, and I'll see what I can do.

Fellow devs, any other recommendations?

Best,

              Tim

On Sun, Nov 18, 2018 at 5:44 AM cnsgithub <[email protected]> wrote:
>
> Hey guys,
>
> we recently contributed some security improvements to the popular JSF
> component library PrimeFaces to validate specified accepted content
> types of uploaded files at server side. Therefore we make use of Java's
> Files.probeContentType which automatically picks up registered
> java.nio.file.spi.FileTypeDetector service implementations. The default
> implementations however primarily check the file extension only by doing
> registry lookups or something like that. That's insufficient from a
> security point of view. That's why we will recommend to have Apache Tika
> in the classpath, more strictly speaking the tika-java7 dependency.
>
> Now there are two questions regarding the use of TikaFileTypeDetector
> and the required dependencies:
> 1. Transitive dependencies of tika-java7 are really big (more than 50
> megabyte). I know that Apache Tika is not just about file type detection
> but very much more like meta data extraction that we don't need at all.
> Would it be okay to exclude tika-parsers which makes up the biggest
> module without losing file type detection abilities?
> 2. Unfortunately, TikaFileTypeDetector defaults to perform most
> efficiently by having included a short circuit if the content type can
> be guessed from the file extension. This is however insecure since we
> want to protect our users from tampered file uploads. We always want to
> use deep (and expensive) content type analysis by looking at the magic
> bytes or something like that. We currently work around this limitation
> by explicitly putting .tmp as the file name's extension to have Tika
> detected application/octet-stream and force it to go ahead. But that's
> some kind of white box knowledge to rely on which is not that good. Can
> you somehow provide different implementations of
> java.nio.file.spi.FileTypeDetector, say "eager" and "lazy"? Please note
> that we are not allowed to introduce required dependencies, i.e. using
> Tika directly is not an option.
>
> Here are the related issues:
> https://github.com/primefaces/primefaces/issues/2791 and
> https://github.com/primefaces/primefaces/issues/4244
> And the pull requests already merged into PrimeFaces 6.3:
> https://github.com/primefaces/primefaces/pull/4242 and
> https://github.com/primefaces/primefaces/pull/4249
>
> Thanks for your response in advance
>
> Kind regards, cnsgithub

Re: Dependencies needed to just use TikaFileTypeDetector

Reply via email to