Re: Text and metadata extraction processor

Joe Skora Thu, 31 Mar 2016 06:28:23 -0700

Dmitry,

Looking at this and your prior email.



   1. I can see "extract metadata only" being as popular as "extract
   metadata and content".  It will all depend on the type of media, for
   audio/video files adding the metadata to the flow file is enough but for
   Word, PDF, etc. files the content may be wanted as well.
   2. After thinking about it, I agree on an enum for mode.
   3. I think any handling of zips or archive files should be handled by
   another processor, that keeps this processor cleaner and improves its
   ability for re-use.
   4. I like the addition of exclude filters but I'm not sure about adding
   content filters.  We will only have a mimetype for the original flow file
   itself so I'm not sure about the metadata mimetype filter.  I think content
   filtering may be best left for another downstream processor, but it might
   be run faster if included here since the entire content will be handled
   during extraction.  If the content filters are implemented, for performance
   they need to short circuit so that if the property is not set or is set to
   ".*" they don't evaluate the regex.
   1. FILENAME_FILTER - selects flow files to process based on filename
      matching regex. (exists)
      2. MIMETYPE_FILTER - selects flow files to process based on mimetype
      matching regex. (exists)
      3. FILENAME_EXCLUDE - excludes already selected flow files from
      processing based on filename matching regex. (new)
      4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
      processing based on mimetype matching regex. (new)
      5. CONTENT_FILTER (optional) - selects flow files for output based on
      extracted content matching regex. (new)
      6. CONTENT_EXCLUDE (optional) - excludes flow files from output based
      on extracted content matching regex. (new)
   5. As indicated in the descriptions in #4, I don't think overlapping
   filters are an error, instead excludes should take precedence over
   includes.  Then I can include a domain (like A*) but exclude sub-sets (like
   AXYZ*).

I'm sure there's something we missed, but I think that covers most of it.

Regards,
Joe


On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <[email protected]
> wrote:

> Joe,
>
> Upon some thinking, I've started wondering whether all the cases can be
> covered by the following filters:
>
> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> files get their content extracted, by file name
> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
> files get their metadata extracted, by file name
> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> files get their content extracted, by MIME type
> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
> files get their metadata extracted, by MIME type
>
> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> files do NOT get their content extracted, by file name
> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
> files do NOT get their metadata extracted, by file name
> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> files do NOT get their content extracted, by MIME type
> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
> files do NOT get their metadata extracted, by MIME type
>
> I believe this gets all the bases covered. At processor init time, we can
> analyze the inclusions vs. exclusions; any overlap would cause a
> configuration error.
>
> Let me know what you think, thanks.
> - Dmitry
>
> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> [email protected]> wrote:
>
> > Hi Joe,
> >
> > I follow your reasoning on the semantics of "media".  One might argue
> that
> > media files are a case of "document" or that a document is a case of
> > "media".
> >
> > I'm not proposing filters for the mode of processing, I'm proposing a
> > flag/enum with 3 values:
> >
> > A) extract metadata only;
> > B) extract content only and place it into the flowfile content;
> > C) extract both metadata and content.
> >
> > I think the default should be C, to extract both.  At least in my
> > experience most flows I've dealt with were interested in extracting both.
> >
> > I don't see how this mode would benefit from being expression driven - ?
> >
> > I think we can add this enum mode and have the basic use case covered.
> >
> > Additionally, further down the line, I was thinking we could ponder the
> > following (these have been essential in search engine ingestion):
> >
> >    1. Extraction from compressed files/archives. How would UnpackContent
> >    work with ExtractMediaAttributes? Use-case being, we've got a zip
> file as
> >    input and want to crack it open and unravel it recursively; it may
> have
> >    other, nested zips inside, along with other documents. One way to
> handle
> >    this is to treat the whole archive as one document and merge all
> attributes
> >    into one FlowFile.  The other way would be to treat each archive
> entry as
> >    its own flow file and keep a pointer back at the parent archive.  Yet
> >    another case is when the user might want to only extract the 'leaf'
> entries
> >    and discard any parent container archives.
> >
> >    2. Attachments and embeddings. Users may want to treat any attached or
> >    embedded files as separate flowfiles with perhaps pointers back to the
> >    parent files. This definitely warrants a filter. Oftentimes Office
> >    documents have 'media' embeddings which are often not of interest,
> >    especially for the case of ingesting into a search engine.
> >
> >    3. PDF. For PDF's, we can do OCR. This is important for the
> >    'image'/scanned PDF's for which Tika won't extract text.
> >
> > I'd like to understand how much of this is already supported in NiFi and
> > if not I'd volunteer/collaborate to implement some of this.
> >
> > - Dmitry
> >
> >
> > On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> wrote:
> >
> >> Dmitry,
> >>
> >> Are you proposing separate filters that determine the mode of
> processing,
> >> metadata/content/metadataAndContent?  I was thinking of one selection
> >> filters and a static mode switch at the processor instance level, to
> make
> >> configuration more obvious such that one instance of the processor will
> >> handle a known set of files regardless of the processing mode.
> >>
> >> I was thinking it would be useful for the mode switch to support
> >> expression
> >> language, but I'm not sure about that since the selection filters will
> >> control what files get processed and it would be harder to configure if
> >> the
> >> output flow file could vary between source format and extracted text.
> So,
> >> while it might be easy to do, and occasionally useful, I think in normal
> >> use I'd never have a varying mode but would more likely have multiple
> >> processor instances with some routing or selection going on further
> >> upstream.
> >>
> >> I wrestled with the naming issue too.  I went with
> >> "ExtractMediaAttributes"
> >> over "ExtractDocumentAttributes" because it seemed to represent the
> >> broader
> >> context better.  In reality, media files and documents and documents are
> >> media files, but in the end it's all just semantics.
> >>
> >> I don't think I would change the NAR bundle name, because I think
> >> "nifi-media-nar" establishes it as a place to collect this and other
> media
> >> related processors in the future.
> >>
> >> Regards,
> >> Joe
> >>
> >> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> >> [email protected]
> >> > wrote:
> >>
> >> > Hi Joe,
> >> >
> >> > Thanks for all the details.
> >> >
> >> > I wanted to propose that I do some of this work so as to go through
> the
> >> > full cycle of developing a processor and committing it.
> >> >
> >> > Once your changes are merged, I could extend your
> 'ExtractMediaMetadata'
> >> > processor to handle the content, in addition to the metadata.
> >> >
> >> > We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode
> >> with 3
> >> > values: metadataOnly, contentOnly, metadataAndContent.
> >> >
> >> > One thing that looks to be a design issue right now is, your changes
> and
> >> > the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
> >> >
> >> > Would it make sense to have a generic processor
> >> > ExtractDocumentMetadataAndContent?  Are there enough specifics in the
> >> > image/video processing stuff to warrant that to be a separate layer;
> >> > perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might it
> make
> >> > sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >> >
> >> > Thanks,
> >> > - Dmitry
> >> >
> >> >
> >> >
> >> > On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]> wrote:
> >> >
> >> > > Dmitry,
> >> > >
> >> > > Yeah, I agree, Tika is pretty impressive.  The original ticket,
> >> NIFI-615
> >> > > <https://issues.apache.org/jira/browse/NIFI-615>, wanted extraction
> >> of
> >> > > metadata from WAV files, but as I got into it I found Tika so for
> the
> >> > same
> >> > > effort it supports the 1,000+ file formats Tika understands.  That
> new
> >> > > processor called "ExtractMediaMetadata", you can pull that pull
> PR-252
> >> > > <https://github.com/apache/nifi/pull/252> from GitHub if you want
> to
> >> > give
> >> > > it a try before it's merged.
> >> > >
> >> > > Extraction content for those 1,000+ formats would be a valuable
> >> addition.
> >> > > I see two possible approaches, 1) create a new "ExtractMediaContent"
> >> > > processor that would put the document content in a new flow file,
> and
> >> 2)
> >> > > extend the new "ExtractMediaMetadata" processor so it can extract
> >> > metadata,
> >> > > content, or both.  One combined processor makes sense if it can
> >> provide a
> >> > > performance gain, otherwise two complementary processors may make
> >> usage
> >> > > easier.
> >> > >
> >> > > I'm glad to help if you want to take a cut at the processor
> yourself,
> >> or
> >> > I
> >> > > can take a crack at it myself if you'd prefer.
> >> > >
> >> > > Don't hesitate to ask questions or share comments and feedback
> >> regarding
> >> > > the ExtractMediaMetadata processor or the addition of content
> >> handling.
> >> > >
> >> > > Regards,
> >> > > Joe Skora
> >> > >
> >> > > On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > Thanks, Joe!
> >> > > >
> >> > > > Hi Joe S. - I'm definitely up for discussing and contributing.
> >> > > >
> >> > > > While building search-related ingestion systems, I've seen
> metadata
> >> and
> >> > > > text extraction being done all the time; it's always there and
> >> always
> >> > has
> >> > > > to be done for building search indexes.  Beyond that, OCR-related
> >> > > > capabilities are often requested, and the advantage of Tika is
> that
> >> it
> >> > > > supports OCR out of the box.
> >> > > >
> >> > > > - Dmitry
> >> > > >
> >> > > > On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]>
> >> wrote:
> >> > > >
> >> > > > > Dmitry,
> >> > > > >
> >> > > > > Another community member (Joe Skora) has a PR outstanding for
> >> > > > > extracting metadata from media files using Tika.  Perhaps it
> makes
> >> > > > > sense to broaden that to in general extract what Tika can find.
> >> Joe
> >> > -
> >> > > > > perhaps you can discuss your ideas with Dmitry and see if
> >> broadening
> >> > > > > is a good idea or if rather domain specific ones make more
> sense.
> >> > > > >
> >> > > > > This concept of extracting metadata from documents/text files,
> >> etc..
> >> > > > > using something like Tika is certainly useful as that then can
> >> drive
> >> > > > > nice automated routing decisions.
> >> > > > >
> >> > > > > Thanks
> >> > > > > Joe
> >> > > > >
> >> > > > > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> >> > > > > <[email protected]> wrote:
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > I see that the ExtractText processor extracts text using
> regex.
> >> > > > > >
> >> > > > > > What about a processor that extracts text and metadata from
> >> > incoming
> >> > > > > > files?  That doesn't seem to exist - but perhaps I didn't
> quite
> >> > look
> >> > > in
> >> > > > > the
> >> > > > > > right spots.
> >> > > > > >
> >> > > > > > If that doesn't exist I'd like to implement and commit it,
> using
> >> > > Apache
> >> > > > > > Tika.  There may also be a couple of related processors to
> that.
> >> > > > > >
> >> > > > > > Thoughts?
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > - Dmitry
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Text and metadata extraction processor

Reply via email to