Re: Text and metadata extraction processor

Dmitry Goldenberg Fri, 01 Apr 2016 08:24:17 -0700

Got it.

What's the typical JIRA ticket triage process like within NiFi?  I'm
curious as to how consensus is built around designs, ticket assignments,
and what goes into a release.


On Fri, Apr 1, 2016 at 10:33 AM, Mark Payne <[email protected]> wrote:

> As far I know, the processors haven't made it into any release yet. If
> that is the case,
> then we could just remove those properties all together and it's easy.
>
> If they have already been released, then we would need to ensure that the
> processor
> is invalid on startup (it doesn't accept those as dynamic properties) and
> then we update
> the migration guide to explain how to obtain the same behavior.
>
> But either way, we can definitely remove the properties if it's determined
> that there is not
> a good enough reason to keep them in.
>
> -Mark
>
>
> > On Apr 1, 2016, at 10:10 AM, Dmitry Goldenberg <[email protected]>
> wrote:
> >
> > Hi Mark,
> >
> > That is a good point.  It also has crossed my mind.  AFAIK,
> > ExtractMediaAttributes already has a couple of similar filters on it; Joe
> > S., please correct me if I'm wrong.  I merely suggested that we extend
> > these filters.
> >
> > I'd have to agree with your points, Mark, that it's cleaner to keep the
> > conditionals separate, on RouteOnAttribute and the like.
> >
> > If that is the consensus then I believe we're back to the idea of a
> "mode"
> > configuration on ExtractMediaAttributes, with 3 values: a)
> > extractMetadataOnly, b) extractContentOnly, c) extractMetadataAndContent.
> > As an alternative we have also considered rolling 3 separate processors:
> > ExtractMetadata, ExtractContent, and ExtractMetadataAndContent.  Given
> that
> > ExtractMediaAttributes already exists, I think it may be easiest to roll
> > with the new "mode" config parameter.
> >
> > One question then is also, what to do with the filters that are already
> on
> > ExtractMediaAttributes - ?  Should they still be there?
> >
> > BTW, I've filed the following JIRA tickets related to the topics we've
> been
> > discussing:
> >
> > Extract metadata and text - NIFI1717
> > <https://issues.apache.org/jira/browse/NIFI-1717>
> > PerformOCR - NIFI1718 <https://issues.apache.org/jira/browse/NIFI-1718>
> > ProcessPDF - NIFI1719 <https://issues.apache.org/jira/browse/NIFI-1719>
> >
> > I'll propagate more info into those as we discuss things more.
> >
> > Mark, could you take a look at: NIFI1716
> > <https://issues.apache.org/jira/browse/NIFI-1716>.  This is a separate
> > topic so we could create a separate discussion thread for the CSV
> splitter.
> >
> > Thanks,
> > - Dmitry
> >
> >
> > On Fri, Apr 1, 2016 at 9:06 AM, Mark Payne <[email protected]> wrote:
> >
> >> Dmitry,
> >>
> >> I would be a bit concerned about providing options for filters that
> >> include and
> >> exclude certain things. I believe that if you send a FlowFile to the
> >> Processor,
> >> then the Processor should do its thing. If you want to filter out which
> >> FlowFiles
> >> have their content extracted, for example, I would suggest using a
> >> Processor
> >> like RouteOnAttribute to ensure that only the appropriate FlowFiles are
> >> processed
> >> by the ExtractMediaMetadata processor.
> >>
> >> This allows the metadata extraction processor to focus purely on
> extracting
> >> metadata and doesn't have to deal with all of the logic of filtering
> >> things out. The logic
> >> for filtering things out is almost guaranteed to grow much more complex
> as
> >> people
> >> start to use this more and more. NiFi already provides several
> route-based
> >> processors
> >> to allow for a great deal of flexibility with this type of logic
> >> (RouteOnAttribute, RouteOnContent,
> >> ScanAttribute, ScanContent, etc.).
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >>
> >>> On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg <
> [email protected]>
> >> wrote:
> >>>
> >>> Simon,
> >>>
> >>> I believe we've moved on past the 'mode' option and have now switched
> to
> >>> talking about how the include/exclude filters, for metadata and
> content,
> >> on
> >>> the one hand side, and filename or MIME type based, on the other hand
> >> side,
> >>> would drive whether meta, content, or both would get extracted.
> >>>
> >>> For example, a user could configure the ExtractMediaAttributes
> processor
> >> to
> >>> extract metadata for all image files (but not content), extract content
> >>> only for plain text documents (but no metadata), or both meta and
> content
> >>> for documents with an extension ".pqr", based on the filename.
> >>>
> >>> Could you elaborate on your vision of how relationships could "drive"
> >> this
> >>> type of functionality?  Joe has already built some of the filtering
> into
> >>> the processor; I just suggested to extend that further, and we get all
> >> the
> >>> bases covered.
> >>>
> >>> I'm not sure I followed your comment on the extracted content being
> >>> transferred into a new FlowFile.  My thoughts were that the extracted
> >>> content would be inserted into a new, dedicated field, called for
> >> example,
> >>> "text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
> >>> especially data ingestion into a search engine, the extracted
> attributes
> >>> *and* the extracted text must travel together as part of the ingested
> >>> document, with the original flowfile-content most likely getting
> dropped
> >> on
> >>> the way into the index.
> >>>
> >>> I guess an alternative could be to have an option to represent the
> >>> extraction results as a new document, and an option to drop the
> original,
> >>> and an option to copy the original's attributes onto the new doc. Seems
> >>> rather complex.  I like the "in-place" extraction.
> >>>
> >>> Could you also elaborate on how a controller service would handle OCR?
> >>> When a document floats into ExtractMediaAttributes, assuming Tesseract
> is
> >>> installed properly, Tika will already automatically fire off OCR.
> Unless
> >>> we turn that off and cause OCR to only be supported via this service.
> >> I'm
> >>> tempted to say why don't we just let Tika do its job for all cases, OCR
> >>> included.  Caveat being that OCR is expensive and it would be nice to
> >> have
> >>> ways of ensuring it has enough resources and doesn't bog the flow down.
> >>>
> >>> For the PDF processor, I'm thinking, yes, PDFBox to break it up into
> >> pages
> >>> and then apply Tika page by page, then aggregate the output together,
> >> with
> >>> a configurable max of up to N pages per document to process (due to how
> >>> slow OCR is).  I already have a prototype of this going, I'll file a
> JIRA
> >>> ticket for this feature.
> >>>
> >>> - Dmitry
> >>>
> >>>
> >>>
> >>> On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <[email protected]>
> >> wrote:
> >>>
> >>>> What I’m suggesting is a single processor for both, but instead of
> >> using a
> >>>> mode property to determine which bits get extracted, you use the state
> >> of
> >>>> the relations on the processor to configure which options tika uses
> and
> >>>> using a single pass to actually parse metadata into attributes, and
> >> content
> >>>> into a new flow file transfer into the parsed relation.
> >>>>
> >>>> On the tesseract front, it may make sense to do this through a
> >> controller
> >>>> service.
> >>>>
> >>>> A PDF processor might be interesting. Are you thinking of something
> like
> >>>> PDFBox, or tika again?
> >>>>
> >>>> Simon
> >>>>
> >>>>
> >>>>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <[email protected]
> >
> >>>> wrote:
> >>>>>
> >>>>> Simon,
> >>>>>
> >>>>> Interesting commentary.  The issue that Joe and I have both looked
> at,
> >>>> with
> >>>>> the splitting of metadata and content extraction, is that if they're
> >>>> split
> >>>>> then the underlying Tika extraction has to process the file twice:
> once
> >>>> to
> >>>>> pull out the attributes and once to pull out the content.  Perhaps it
> >> may
> >>>>> be good to add ExtractMetadata and ExtractTextContent in addition to
> >>>>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be
> >> wrong.
> >>>>>
> >>>>> It seems prudent to provide one wholesome, out-of-the-box extractor
> >>>>> processor with options to extract just metadata, just content, or
> both
> >>>>> metadata and content.
> >>>>>
> >>>>> I think what I'm hearing is that we need to allow for checking
> >> somewhere
> >>>>> for whether text/content has already been extracted by the time we
> get
> >> to
> >>>>> the ExtractMediaAttributes processor - ?  If that is the issue then I
> >>>>> believe the user would use RouteOnAttribute and if the content is
> >> already
> >>>>> filled in then they'd not route to ExtractMediaAttributes.
> >>>>>
> >>>>> As far as the OCR.  Tika internally supports OCR by directing image
> >> files
> >>>>> to Tesseract (if Tesseract is installed and configured properly).
> >> We've
> >>>>> started talking about how this could be reconciled in the
> >>>>> ExtractMediaAttributes.
> >>>>>
> >>>>> I think that once we have the basic ExtractMediaAttributes, we could
> >> add
> >>>>> filters for what files to enable the OCR on, and we'd need to expose
> a
> >>>> few
> >>>>> config parameters specific to OCR, such as e.g. the location of the
> >>>>> Tesseract installation and the maximum file size on which to attempt
> >> the
> >>>>> OCR.  Perhaps there can also be a RunOCR processor which would be
> >>>> dedicated
> >>>>> to running OCR.  But since Tika already has OCR integrated we'd
> >> probably
> >>>>> want to take care of that in the ExtractMediaAttributes
> configuration.
> >>>>>
> >>>>> Additionally, I've proposed the idea of a ProcessPDF processor which
> >>>> would
> >>>>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would
> >>>> break
> >>>>> it up into pages and run OCR on each page, then aggregate the
> extracted
> >>>>> text.
> >>>>>
> >>>>> - Dmitry
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> Just a thought…
> >>>>>>
> >>>>>> To keep consistent with other Nifi Parse patterns, would it make
> sense
> >>>> to
> >>>>>> based the extraction of content on the presence of a relation. So
> your
> >>>> tika
> >>>>>> processor would have an original relation which would have meta data
> >>>>>> attached as attributed, and an extracted relation which would have
> the
> >>>>>> metadata and the processed content (text from OCRed image for
> >> example).
> >>>>>> That way you can just use context.hasConnection(relationship) to
> >>>> determine
> >>>>>> whether to enable the tika content processing.
> >>>>>>
> >>>>>> This seems more idiomatic than a mode flag.
> >>>>>>
> >>>>>> Simon
> >>>>>>
> >>>>>>> On 31 Mar 2016, at 19:48, Joe Skora <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Dmitry,
> >>>>>>>
> >>>>>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> >>>>>> FILTER"
> >>>>>>> entries referred to some MIME type of the metadata, but you meant
> to
> >>>> use
> >>>>>>> the file's MIME type to select what files have metadata extracted.
> >>>>>>>
> >>>>>>> Sorry, about that, I think we are on the same page.
> >>>>>>>
> >>>>>>> Joe
> >>>>>>>
> >>>>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Hi Joe,
> >>>>>>>>
> >>>>>>>> I think if we have the filters in place then there's no need for
> the
> >>>>>> 'mode'
> >>>>>>>> enum, as the filters themselves guide the processor in deciding
> >>>> whether
> >>>>>>>> metadata and/or content is extracted for a given input file.
> >>>>>>>>
> >>>>>>>> Agreed on the handling of archives as a separate processor
> >> (template,
> >>>>>> seems
> >>>>>>>> like).
> >>>>>>>>
> >>>>>>>> I think it's easiest to do both metadata and/or content in one
> >>>> processor
> >>>>>>>> since it can tell Tika whether to extract metadata and/or content,
> >> in
> >>>>>> one
> >>>>>>>> pass over the file bytes (as you pointed out).
> >>>>>>>>
> >>>>>>>> Agreed on the exclusions trumping inclusions; I think that makes
> >>>> sense.
> >>>>>>>>
> >>>>>>>>>> We will only have a mimetype for the original flow file itself
> so
> >>>> I'm
> >>>>>>>> not sure about the metadata mimetype filter.
> >>>>>>>>
> >>>>>>>> I'm not sure where there might be an issue here. The metadata MIME
> >>>> type
> >>>>>>>> filter tells the processor for which MIME types to perform the
> >>>> metadata
> >>>>>>>> extraction.  For instance, extract metadata for images and videos,
> >>>> only.
> >>>>>>>> This could possibly be coupled with an exclusion filter for
> content
> >>>> that
> >>>>>>>> says, don't try to extract content from images and videos.
> >>>>>>>>
> >>>>>>>> I think with the six filters we get all the bases covered:
> >>>>>>>>
> >>>>>>>> 1. include metadata? --
> >>>>>>>>   1. yes --
> >>>>>>>>      1. determine the inclusion of metadata by filename pattern
> >>>>>>>>      2. determine the inclusion of metadata by MIME type pattern
> >>>>>>>>   2. no --
> >>>>>>>>      1. determine the exclusion of metadata by filename pattern
> >>>>>>>>      2. determine the exclusion of metadata by MIME type pattern
> >>>>>>>>   2. include content? --
> >>>>>>>>   1. yes --
> >>>>>>>>      1. determine the inclusion of content by filename pattern
> >>>>>>>>      2. determine the inclusion of content by MIME type pattern
> >>>>>>>>   2. no --
> >>>>>>>>      1. determine the exclusion of content by filename pattern
> >>>>>>>>      2. determine the exclusion of content by MIME type pattern
> >>>>>>>>
> >>>>>>>> Does this work?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> - Dmitry
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <[email protected]>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Dmitry,
> >>>>>>>>>
> >>>>>>>>> Looking at this and your prior email.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 1. I can see "extract metadata only" being as popular as "extract
> >>>>>>>>> metadata and content".  It will all depend on the type of media,
> >> for
> >>>>>>>>> audio/video files adding the metadata to the flow file is enough
> >> but
> >>>>>>>> for
> >>>>>>>>> Word, PDF, etc. files the content may be wanted as well.
> >>>>>>>>> 2. After thinking about it, I agree on an enum for mode.
> >>>>>>>>> 3. I think any handling of zips or archive files should be
> handled
> >>>> by
> >>>>>>>>> another processor, that keeps this processor cleaner and improves
> >>>> its
> >>>>>>>>> ability for re-use.
> >>>>>>>>> 4. I like the addition of exclude filters but I'm not sure about
> >>>>>>>> adding
> >>>>>>>>> content filters.  We will only have a mimetype for the original
> >> flow
> >>>>>>>>> file
> >>>>>>>>> itself so I'm not sure about the metadata mimetype filter.  I
> think
> >>>>>>>>> content
> >>>>>>>>> filtering may be best left for another downstream processor, but
> it
> >>>>>>>>> might
> >>>>>>>>> be run faster if included here since the entire content will be
> >>>>>>>> handled
> >>>>>>>>> during extraction.  If the content filters are implemented, for
> >>>>>>>>> performance
> >>>>>>>>> they need to short circuit so that if the property is not set or
> is
> >>>>>>>> set
> >>>>>>>>> to
> >>>>>>>>> ".*" they don't evaluate the regex.
> >>>>>>>>> 1. FILENAME_FILTER - selects flow files to process based on
> >> filename
> >>>>>>>>>   matching regex. (exists)
> >>>>>>>>>   2. MIMETYPE_FILTER - selects flow files to process based on
> >>>>>>>> mimetype
> >>>>>>>>>   matching regex. (exists)
> >>>>>>>>>   3. FILENAME_EXCLUDE - excludes already selected flow files from
> >>>>>>>>>   processing based on filename matching regex. (new)
> >>>>>>>>>   4. MIMETYPE_EXCLUDE - excludes already selected flow  files
> from
> >>>>>>>>>   processing based on mimetype matching regex. (new)
> >>>>>>>>>   5. CONTENT_FILTER (optional) - selects flow files for output
> >>>> based
> >>>>>>>> on
> >>>>>>>>>   extracted content matching regex. (new)
> >>>>>>>>>   6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> >>>>>>>> based
> >>>>>>>>>   on extracted content matching regex. (new)
> >>>>>>>>> 5. As indicated in the descriptions in #4, I don't think
> >> overlapping
> >>>>>>>>> filters are an error, instead excludes should take precedence
> over
> >>>>>>>>> includes.  Then I can include a domain (like A*) but exclude
> >>>> sub-sets
> >>>>>>>>> (like
> >>>>>>>>> AXYZ*).
> >>>>>>>>>
> >>>>>>>>> I'm sure there's something we missed, but I think that covers
> most
> >> of
> >>>>>> it.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Joe
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> >>>>>>>>> [email protected]
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Joe,
> >>>>>>>>>>
> >>>>>>>>>> Upon some thinking, I've started wondering whether all the cases
> >> can
> >>>>>> be
> >>>>>>>>>> covered by the following filters:
> >>>>>>>>>>
> >>>>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files get their content extracted, by file name
> >>>>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files get their metadata extracted, by file name
> >>>>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files get their content extracted, by MIME type
> >>>>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files get their metadata extracted, by MIME type
> >>>>>>>>>>
> >>>>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files do NOT get their content extracted, by file name
> >>>>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files do NOT get their metadata extracted, by file name
> >>>>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which
> >>>> input
> >>>>>>>>>> files do NOT get their content extracted, by MIME type
> >>>>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for
> which
> >>>>>> input
> >>>>>>>>>> files do NOT get their metadata extracted, by MIME type
> >>>>>>>>>>
> >>>>>>>>>> I believe this gets all the bases covered. At processor init
> time,
> >>>> we
> >>>>>>>> can
> >>>>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
> >>>>>>>>>> configuration error.
> >>>>>>>>>>
> >>>>>>>>>> Let me know what you think, thanks.
> >>>>>>>>>> - Dmitry
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> >>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Joe,
> >>>>>>>>>>>
> >>>>>>>>>>> I follow your reasoning on the semantics of "media".  One might
> >>>> argue
> >>>>>>>>>> that
> >>>>>>>>>>> media files are a case of "document" or that a document is a
> case
> >>>> of
> >>>>>>>>>>> "media".
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not proposing filters for the mode of processing, I'm
> >>>> proposing a
> >>>>>>>>>>> flag/enum with 3 values:
> >>>>>>>>>>>
> >>>>>>>>>>> A) extract metadata only;
> >>>>>>>>>>> B) extract content only and place it into the flowfile content;
> >>>>>>>>>>> C) extract both metadata and content.
> >>>>>>>>>>>
> >>>>>>>>>>> I think the default should be C, to extract both.  At least in
> my
> >>>>>>>>>>> experience most flows I've dealt with were interested in
> >> extracting
> >>>>>>>>> both.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't see how this mode would benefit from being expression
> >>>> driven
> >>>>>>>> -
> >>>>>>>>> ?
> >>>>>>>>>>>
> >>>>>>>>>>> I think we can add this enum mode and have the basic use case
> >>>>>>>> covered.
> >>>>>>>>>>>
> >>>>>>>>>>> Additionally, further down the line, I was thinking we could
> >> ponder
> >>>>>>>> the
> >>>>>>>>>>> following (these have been essential in search engine
> ingestion):
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Extraction from compressed files/archives. How would
> >>>>>>>>> UnpackContent
> >>>>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a
> zip
> >>>>>>>>>> file as
> >>>>>>>>>>> input and want to crack it open and unravel it recursively; it
> >> may
> >>>>>>>>>> have
> >>>>>>>>>>> other, nested zips inside, along with other documents. One way
> to
> >>>>>>>>>> handle
> >>>>>>>>>>> this is to treat the whole archive as one document and merge
> all
> >>>>>>>>>> attributes
> >>>>>>>>>>> into one FlowFile.  The other way would be to treat each
> archive
> >>>>>>>>>> entry as
> >>>>>>>>>>> its own flow file and keep a pointer back at the parent
> archive.
> >>>>>>>>> Yet
> >>>>>>>>>>> another case is when the user might want to only extract the
> >>>>>>>> 'leaf'
> >>>>>>>>>> entries
> >>>>>>>>>>> and discard any parent container archives.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Attachments and embeddings. Users may want to treat any
> >>>>>>>> attached
> >>>>>>>>> or
> >>>>>>>>>>> embedded files as separate flowfiles with perhaps pointers back
> >> to
> >>>>>>>>> the
> >>>>>>>>>>> parent files. This definitely warrants a filter. Oftentimes
> >> Office
> >>>>>>>>>>> documents have 'media' embeddings which are often not of
> >> interest,
> >>>>>>>>>>> especially for the case of ingesting into a search engine.
> >>>>>>>>>>>
> >>>>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the
> >>>>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to understand how much of this is already supported in
> >>>> NiFi
> >>>>>>>>> and
> >>>>>>>>>>> if not I'd volunteer/collaborate to implement some of this.
> >>>>>>>>>>>
> >>>>>>>>>>> - Dmitry
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Are you proposing separate filters that determine the mode of
> >>>>>>>>>> processing,
> >>>>>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
> >>>>>>>> selection
> >>>>>>>>>>>> filters and a static mode switch at the processor instance
> >> level,
> >>>> to
> >>>>>>>>>> make
> >>>>>>>>>>>> configuration more obvious such that one instance of the
> >> processor
> >>>>>>>>> will
> >>>>>>>>>>>> handle a known set of files regardless of the processing mode.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I was thinking it would be useful for the mode switch to
> support
> >>>>>>>>>>>> expression
> >>>>>>>>>>>> language, but I'm not sure about that since the selection
> >> filters
> >>>>>>>> will
> >>>>>>>>>>>> control what files get processed and it would be harder to
> >>>> configure
> >>>>>>>>> if
> >>>>>>>>>>>> the
> >>>>>>>>>>>> output flow file could vary between source format and
> extracted
> >>>>>>>> text.
> >>>>>>>>>> So,
> >>>>>>>>>>>> while it might be easy to do, and occasionally useful, I think
> >> in
> >>>>>>>>> normal
> >>>>>>>>>>>> use I'd never have a varying mode but would more likely have
> >>>>>>>> multiple
> >>>>>>>>>>>> processor instances with some routing or selection going on
> >>>> further
> >>>>>>>>>>>> upstream.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wrestled with the naming issue too.  I went with
> >>>>>>>>>>>> "ExtractMediaAttributes"
> >>>>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to
> represent
> >>>> the
> >>>>>>>>>>>> broader
> >>>>>>>>>>>> context better.  In reality, media files and documents and
> >>>> documents
> >>>>>>>>> are
> >>>>>>>>>>>> media files, but in the end it's all just semantics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't think I would change the NAR bundle name, because I
> >> think
> >>>>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and
> >>>> other
> >>>>>>>>>> media
> >>>>>>>>>>>> related processors in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Joe
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Joe,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for all the details.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I wanted to propose that I do some of this work so as to go
> >>>>>>>> through
> >>>>>>>>>> the
> >>>>>>>>>>>>> full cycle of developing a processor and committing it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Once your changes are merged, I could extend your
> >>>>>>>>>> 'ExtractMediaMetadata'
> >>>>>>>>>>>>> processor to handle the content, in addition to the metadata.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but
> add a
> >>>>>>>> mode
> >>>>>>>>>>>> with 3
> >>>>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing that looks to be a design issue right now is, your
> >>>>>>>> changes
> >>>>>>>>>> and
> >>>>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar"
> etc.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Would it make sense to have a generic processor
> >>>>>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough
> specifics
> >> in
> >>>>>>>>> the
> >>>>>>>>>>>>> image/video processing stuff to warrant that to be a separate
> >>>>>>>> layer;
> >>>>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?
> >> Might
> >>>>>>>> it
> >>>>>>>>>> make
> >>>>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]
> >
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original
> >> ticket,
> >>>>>>>>>>>> NIFI-615
> >>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
> >>>>>>>>> extraction
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika
> so
> >>>>>>>> for
> >>>>>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
> >>>>>>>> That
> >>>>>>>>>> new
> >>>>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that
> >> pull
> >>>>>>>>>> PR-252
> >>>>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if
> you
> >>>>>>>> want
> >>>>>>>>>> to
> >>>>>>>>>>>>> give
> >>>>>>>>>>>>>> it a try before it's merged.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Extraction content for those 1,000+ formats would be a
> >> valuable
> >>>>>>>>>>>> addition.
> >>>>>>>>>>>>>> I see two possible approaches, 1) create a new
> >>>>>>>>> "ExtractMediaContent"
> >>>>>>>>>>>>>> processor that would put the document content in a new flow
> >>>>>>>> file,
> >>>>>>>>>> and
> >>>>>>>>>>>> 2)
> >>>>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
> >>>>>>>> extract
> >>>>>>>>>>>>> metadata,
> >>>>>>>>>>>>>> content, or both.  One combined processor makes sense if it
> >> can
> >>>>>>>>>>>> provide a
> >>>>>>>>>>>>>> performance gain, otherwise two complementary processors may
> >>>>>>>> make
> >>>>>>>>>>>> usage
> >>>>>>>>>>>>>> easier.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
> >>>>>>>>>> yourself,
> >>>>>>>>>>>> or
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>> can take a crack at it myself if you'd prefer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Don't hesitate to ask questions or share comments and
> feedback
> >>>>>>>>>>>> regarding
> >>>>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of
> content
> >>>>>>>>>>>> handling.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Joe Skora
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks, Joe!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and
> >> contributing.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> While building search-related ingestion systems, I've seen
> >>>>>>>>>> metadata
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>> text extraction being done all the time; it's always there
> >> and
> >>>>>>>>>>>> always
> >>>>>>>>>>>>> has
> >>>>>>>>>>>>>>> to be done for building search indexes.  Beyond that,
> >>>>>>>>> OCR-related
> >>>>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika
> >> is
> >>>>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>>>> supports OCR out of the box.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
> >>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Dmitry,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
> >>>>>>>> for
> >>>>>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps
> it
> >>>>>>>>>> makes
> >>>>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
> >>>>>>>>> find.
> >>>>>>>>>>>> Joe
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
> >>>>>>>>>>>> broadening
> >>>>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
> >>>>>>>>>> sense.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This concept of extracting metadata from documents/text
> >>>>>>>> files,
> >>>>>>>>>>>> etc..
> >>>>>>>>>>>>>>>> using something like Tika is certainly useful as that then
> >>>>>>>> can
> >>>>>>>>>>>> drive
> >>>>>>>>>>>>>>>> nice automated routing decisions.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>> Joe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >>>>>>>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
> >>>>>>>>>> regex.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What about a processor that extracts text and metadata
> >>>>>>>> from
> >>>>>>>>>>>>> incoming
> >>>>>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
> >>>>>>>>>> quite
> >>>>>>>>>>>>> look
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> right spots.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit
> it,
> >>>>>>>>>> using
> >>>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>>>> Tika.  There may also be a couple of related processors
> to
> >>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> - Dmitry
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Text and metadata extraction processor

Reply via email to