[ 
https://issues.apache.org/jira/browse/NIFI-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15481534#comment-15481534
 ] 

Andre commented on NIFI-2374:
-----------------------------

[~joewitt]

Note sure if we are on the same page, but this is truly a version bump, no 
added functionality, specially around metadata extraction via parsers.

1 - I am not sure if we need the parsers to be honest... If I understand Tika 
correctly, the core library does identification while the Parsers would allow 
us to extract metadata from the identified files.

I base this understanding on the following excerpt from the URL you linked:

bq. Please note that Apache Tika is able to detect a much wider range of 
formats than those listed below, this page only documents those formats from 
which Tika is able to extract metadata and/or textual content.

2 - The list is for parsers, not for "file magic" performed by 
[Detector|https://tika.apache.org/1.13/api/org/apache/tika/detect/Detector.html]
  we call here: 

https://github.com/apache/nifi/blob/f987b216090f29719976ed1693be2ea358523aa5/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java#L134

I tried to find a better list but couldn't. :-(

3 - Very valid point... Afaik no changes in regards to NIFI-2667 :-)




So just to emphasise again, my idea was just to bump dependency version, 
without adding any additional Tika feature. Let me know if you would like some 
extra action I will be happy to address.






> IdentifyMimeType documentation is misleading
> --------------------------------------------
>
>                 Key: NIFI-2374
>                 URL: https://issues.apache.org/jira/browse/NIFI-2374
>             Project: Apache NiFi
>          Issue Type: Improvement
>    Affects Versions: 1.0.0, 0.7.0
>            Reporter: Andre
>            Assignee: Andre
>            Priority: Minor
>             Fix For: 1.1.0
>
>
> The current documentation of IdentifyMimeType mentions the processor is 
> capable of identifying a reasonably small range of file types.
> However, upon inspecting the code, it becomes evident that the processor 
> employs Apache Tike detectors and parsers (required to distinguish a ZIP file 
> from a JAR).
> This means the list of File(MIME) types detected is the same as the one 
> present in Tika's DefaultDetector.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to