[
https://issues.apache.org/jira/browse/NIFI-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326604#comment-14326604
]
ASF GitHub Bot commented on NIFI-296:
-------------------------------------
Github user adamonduty commented on a diff in the pull request:
https://github.com/apache/incubator-nifi/pull/27#discussion_r24943917
--- Diff:
nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java
---
@@ -239,87 +128,39 @@ public void onTrigger(final ProcessContext context,
final ProcessSession session
}
final ProcessorLog logger = getLogger();
- final boolean identifyZip =
context.getProperty(IDENTIFY_ZIP).asBoolean();
- final boolean identifyTar =
context.getProperty(IDENTIFY_TAR).asBoolean();
final ObjectHolder<String> mimeTypeRef = new ObjectHolder<>(null);
+ final ObjectHolder<String> extensionRef = new ObjectHolder<>(null);
session.read(flowFile, new InputStreamCallback() {
@Override
public void process(final InputStream stream) throws
IOException {
try (final InputStream in = new
BufferedInputStream(stream)) {
- // read in up to magicHeaderMaxLength bytes
- in.mark(magicHeaderMaxLength);
- byte[] header = new byte[magicHeaderMaxLength];
- for (int i = 0; i < header.length; i++) {
- final int next = in.read();
- if (next >= 0) {
- header[i] = (byte) next;
- } else if (i == 0) {
- header = new byte[0];
- } else {
- final byte[] newBuffer = new byte[i - 1];
- System.arraycopy(header, 0, newBuffer, 0, i -
1);
- header = newBuffer;
- break;
- }
- }
- in.reset();
-
- for (final MagicHeader magicHeader : magicHeaders) {
- if (magicHeader.matches(header)) {
- mimeTypeRef.set(magicHeader.getMimeType());
- return;
- }
- }
-
- if (!identifyZip) {
- for (final MagicHeader magicHeader :
zipMagicHeaders) {
- if (magicHeader.matches(header)) {
- mimeTypeRef.set(magicHeader.getMimeType());
- return;
- }
- }
- }
-
- if (!identifyTar) {
- for (final MagicHeader magicHeader :
tarMagicHeaders) {
- if (magicHeader.matches(header)) {
- mimeTypeRef.set(magicHeader.getMimeType());
- return;
- }
- }
+ TikaInputStream tikaStream = TikaInputStream.get(in);
+ Metadata metadata = new Metadata();
+ // Get mime type
+ MediaType mediatype = detector.detect(tikaStream,
metadata);
+ mimeTypeRef.set(mediatype.toString());
+ // Get common file extension
+ try {
+ MimeType mimetype;
+ mimetype =
config.getMimeRepository().forName(mediatype.toString());
+ extensionRef.set(mimetype.getExtension());
+ } catch (MimeTypeException ex) {
+ logger.warn("MIME type detection failed: {}", new
Object[]{ex.toString()});
--- End diff --
Didn't know that! I'll fix and re-push.
> Extend the capability of IdentifyMimeType and extract document metadata
> -----------------------------------------------------------------------
>
> Key: NIFI-296
> URL: https://issues.apache.org/jira/browse/NIFI-296
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Reporter: Joseph Witt
> Priority: Minor
>
> Apache Tika is pretty awesome and can handle a large range of document types.
> It could perhaps be used to extend the capability of IdentifyMimeType and it
> could also potentially be used to automatically extract document
> metadata/data as flow file attributes to be used for data flow routing
> decisions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)