[jira] [Commented] (NIFI-296) Extend the capability of IdentifyMimeType and extract document metadata

ASF GitHub Bot (JIRA) Wed, 18 Feb 2015 13:54:08 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326604#comment-14326604
 ]


ASF GitHub Bot commented on NIFI-296:
-------------------------------------

Github user adamonduty commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24943917
  
    --- Diff: 
nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java
 ---
    @@ -239,87 +128,39 @@ public void onTrigger(final ProcessContext context, 
final ProcessSession session
             }
     
             final ProcessorLog logger = getLogger();
    -        final boolean identifyZip = 
context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        final boolean identifyTar = 
context.getProperty(IDENTIFY_TAR).asBoolean();
     
             final ObjectHolder<String> mimeTypeRef = new ObjectHolder<>(null);
    +        final ObjectHolder<String> extensionRef = new ObjectHolder<>(null);
             session.read(flowFile, new InputStreamCallback() {
                 @Override
                 public void process(final InputStream stream) throws 
IOException {
                     try (final InputStream in = new 
BufferedInputStream(stream)) {
    -                    // read in up to magicHeaderMaxLength bytes
    -                    in.mark(magicHeaderMaxLength);
    -                    byte[] header = new byte[magicHeaderMaxLength];
    -                    for (int i = 0; i < header.length; i++) {
    -                        final int next = in.read();
    -                        if (next >= 0) {
    -                            header[i] = (byte) next;
    -                        } else if (i == 0) {
    -                            header = new byte[0];
    -                        } else {
    -                            final byte[] newBuffer = new byte[i - 1];
    -                            System.arraycopy(header, 0, newBuffer, 0, i - 
1);
    -                            header = newBuffer;
    -                            break;
    -                        }
    -                    }
    -                    in.reset();
    -
    -                    for (final MagicHeader magicHeader : magicHeaders) {
    -                        if (magicHeader.matches(header)) {
    -                            mimeTypeRef.set(magicHeader.getMimeType());
    -                            return;
    -                        }
    -                    }
    -
    -                    if (!identifyZip) {
    -                        for (final MagicHeader magicHeader : 
zipMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    -                    }
    -
    -                    if (!identifyTar) {
    -                        for (final MagicHeader magicHeader : 
tarMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    +                    TikaInputStream tikaStream = TikaInputStream.get(in);
    +                    Metadata metadata = new Metadata();
    +                    // Get mime type
    +                    MediaType mediatype = detector.detect(tikaStream, 
metadata);
    +                    mimeTypeRef.set(mediatype.toString());
    +                    // Get common file extension
    +                    try {
    +                        MimeType mimetype;
    +                        mimetype = 
config.getMimeRepository().forName(mediatype.toString());
    +                        extensionRef.set(mimetype.getExtension());
    +                    } catch (MimeTypeException ex) {
    +                        logger.warn("MIME type detection failed: {}", new 
Object[]{ex.toString()});
    --- End diff --
    
    Didn't know that! I'll fix and re-push.


> Extend the capability of IdentifyMimeType and extract document metadata
> -----------------------------------------------------------------------
>
>                 Key: NIFI-296
>                 URL: https://issues.apache.org/jira/browse/NIFI-296
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Joseph Witt
>            Priority: Minor
>
> Apache Tika is pretty awesome and can handle a large range of document types. 
>  It could perhaps be used to extend the capability of IdentifyMimeType and it 
> could also potentially be used to automatically extract document 
> metadata/data as flow file attributes to be used for data flow routing 
> decisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-296) Extend the capability of IdentifyMimeType and extract document metadata

Reply via email to