Mark / Adam,

The size of this thing does seem a bit problematic to me.  Perhaps we
can engage with the Tika team to see if they have recommendations on
how we can reduce it down.  We likely don't need its full horsepower.
But 37 MB is a ton and doing all the work for all the dependencies and
licensing accounting is a non-trivial tail as well.

Thanks
Joe

On Mon, Feb 23, 2015 at 3:17 PM, Mark Payne <[email protected]> wrote:
> Adam,
>
>
> I definitely like the changes here. I reviewed the code, and I am happy with 
> it.
>
>
> The only thing here that is really giving me pause is the unexpectedly large 
> size of the dependency. Pulling in Tika ends up bloating the standard nar 
> from 12 MB to a whopping 37 MB. This isn’t the end of the world, but I am 
> concerned about pulling this in because the deployment already is over 100 
> MB, and there have been some discussions before about the concern of the NiFi 
> build becoming so bloated.
>
>
> Are others ok with adding the 25 MB to the build for IdentifyMimeType, or 
> does this give others pause as well?
>
>
>
>
>
>
>
>
>
> From: adamonduty
> Sent: ‎Tuesday‎, ‎February‎ ‎17‎, ‎2015 ‎1‎:‎56‎ ‎PM
> To: [email protected]
>
>
>
>
>
> GitHub user adamonduty opened a pull request:
>
>     https://github.com/apache/incubator-nifi/pull/27
>
>     NIFI-296: Extend capability of IdentifyMimeType
>
>     ```
>     This commit backs IdentifyMimeType with the Apache Tika library. Tika
>     provides detailed mime type identification such as the ability to
>     differentiate normal zip files from OOXML MS Office documents.
>
>     The mime.type attribute continues to be set, though some mime types
>     have changed due to Tika naming them differently. In addition,
>     the mime.extension attribute is set to provide the commonly used
>     extension for the mime type (if known).
>     ```
>
>     Some additional notes about this commit:
>
>     I removed the IDENTIFY_ZIP and IDENTIFY_TAR properties. Keeping 
> IDENTIFY_ZIP doesn't make sense because Tika is designed to identify 
> container formats like zip files. Excluding zip files from detection would 
> exclude a number of common mime types, which seems like undesirable behavior. 
> IDENTIFY_TAR is in a similar situation.
>
>     Also, in both cases, the previous code would "identify" a zip or tar file 
> by attempting to open them with Zip and Tar readers. I believe Tika will use 
> magic byte detection as a filtering mechanism to avoid applying deep 
> inspection logic (ie opening the zip with a reader) when not necessary.
>
>     It takes about 2 seconds to bring up the Tika detectors, which makes the 
> tests run longer, but I believe the detection itself is roughly in the same 
> performance category. The code shares a Tika config and list of detectors to 
> minimize the performance impact related to bringing up detectors.
>
>     I also replaced the test resource `1.tar` with a version created by a 
> modern version of tar. The previous tar didn't use the <a 
> href="http://en.wikipedia.org/wiki/Tar_%28computing%29#UStar_format";>ustar 
> format</a>, which was standardized in 1988. Tika also couldn't identify the 
> previous tar using magic byte
>     detection.
>
>     And finally, a few of the detected mime types changed names due to Tika 
> naming them differently.
>
> You can merge this pull request into a Git repository by running:
>
>     $ git pull https://github.com/adamonduty/incubator-nifi 
> NIFI-296-extend-IdentifyMimeType
>
> Alternatively you can review and apply these changes as the patch at:
>
>     https://github.com/apache/incubator-nifi/pull/27.patch
>
> To close this pull request, make a commit to your master/trunk branch
> with (at least) the following in the commit message:
>
>     This closes #27
>
> ----
> commit 16fb2b826c0cd983b5d905ceed7aff2a84383d33
> Author: Adam Lamar <[email protected]>
> Date:   2015-02-14T20:57:41Z
>
>     NIFI-296: Extend capability of IdentifyMimeType
>
>     This commit backs IdentifyMimeType with the Apache Tika library. Tika
>     provides detailed mime type identification such as the ability to
>     differentiate normal zip files from OOXML MS Office documents.
>
>     The mime.type attribute continues to be set, though some mime types
>     have changed due to Tika naming them differently. In addition,
>     the mime.extension attribute is set to provide the commonly used
>     extension for the mime type (if known).
>
> ----
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at [email protected] or file a JIRA ticket
> with INFRA.
> ---

Reply via email to