Re: Text and metadata extraction processor

Dmitry Goldenberg Wed, 30 Mar 2016 10:57:06 -0700

Joe,

Upon some thinking, I've started wondering whether all the cases can be
covered by the following filters:


INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
files get their content extracted, by file name
INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
files get their metadata extracted, by file name
INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
files get their content extracted, by MIME type
INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
files get their metadata extracted, by MIME type

EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
files do NOT get their content extracted, by file name
EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
files do NOT get their metadata extracted, by file name
EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
files do NOT get their content extracted, by MIME type
EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
files do NOT get their metadata extracted, by MIME type

I believe this gets all the bases covered. At processor init time, we can
analyze the inclusions vs. exclusions; any overlap would cause a
configuration error.

Let me know what you think, thanks.
- Dmitry

On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
[email protected]> wrote:

> Hi Joe,
>
> I follow your reasoning on the semantics of "media".  One might argue that
> media files are a case of "document" or that a document is a case of
> "media".
>
> I'm not proposing filters for the mode of processing, I'm proposing a
> flag/enum with 3 values:
>
> A) extract metadata only;
> B) extract content only and place it into the flowfile content;
> C) extract both metadata and content.
>
> I think the default should be C, to extract both.  At least in my
> experience most flows I've dealt with were interested in extracting both.
>
> I don't see how this mode would benefit from being expression driven - ?
>
> I think we can add this enum mode and have the basic use case covered.
>
> Additionally, further down the line, I was thinking we could ponder the
> following (these have been essential in search engine ingestion):
>
>    1. Extraction from compressed files/archives. How would UnpackContent
>    work with ExtractMediaAttributes? Use-case being, we've got a zip file as
>    input and want to crack it open and unravel it recursively; it may have
>    other, nested zips inside, along with other documents. One way to handle
>    this is to treat the whole archive as one document and merge all attributes
>    into one FlowFile.  The other way would be to treat each archive entry as
>    its own flow file and keep a pointer back at the parent archive.  Yet
>    another case is when the user might want to only extract the 'leaf' entries
>    and discard any parent container archives.
>
>    2. Attachments and embeddings. Users may want to treat any attached or
>    embedded files as separate flowfiles with perhaps pointers back to the
>    parent files. This definitely warrants a filter. Oftentimes Office
>    documents have 'media' embeddings which are often not of interest,
>    especially for the case of ingesting into a search engine.
>
>    3. PDF. For PDF's, we can do OCR. This is important for the
>    'image'/scanned PDF's for which Tika won't extract text.
>
> I'd like to understand how much of this is already supported in NiFi and
> if not I'd volunteer/collaborate to implement some of this.
>
> - Dmitry
>
>
> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> wrote:
>
>> Dmitry,
>>
>> Are you proposing separate filters that determine the mode of processing,
>> metadata/content/metadataAndContent?  I was thinking of one selection
>> filters and a static mode switch at the processor instance level, to make
>> configuration more obvious such that one instance of the processor will
>> handle a known set of files regardless of the processing mode.
>>
>> I was thinking it would be useful for the mode switch to support
>> expression
>> language, but I'm not sure about that since the selection filters will
>> control what files get processed and it would be harder to configure if
>> the
>> output flow file could vary between source format and extracted text.  So,
>> while it might be easy to do, and occasionally useful, I think in normal
>> use I'd never have a varying mode but would more likely have multiple
>> processor instances with some routing or selection going on further
>> upstream.
>>
>> I wrestled with the naming issue too.  I went with
>> "ExtractMediaAttributes"
>> over "ExtractDocumentAttributes" because it seemed to represent the
>> broader
>> context better.  In reality, media files and documents and documents are
>> media files, but in the end it's all just semantics.
>>
>> I don't think I would change the NAR bundle name, because I think
>> "nifi-media-nar" establishes it as a place to collect this and other media
>> related processors in the future.
>>
>> Regards,
>> Joe
>>
>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
>> [email protected]
>> > wrote:
>>
>> > Hi Joe,
>> >
>> > Thanks for all the details.
>> >
>> > I wanted to propose that I do some of this work so as to go through the
>> > full cycle of developing a processor and committing it.
>> >
>> > Once your changes are merged, I could extend your 'ExtractMediaMetadata'
>> > processor to handle the content, in addition to the metadata.
>> >
>> > We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode
>> with 3
>> > values: metadataOnly, contentOnly, metadataAndContent.
>> >
>> > One thing that looks to be a design issue right now is, your changes and
>> > the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>> >
>> > Would it make sense to have a generic processor
>> > ExtractDocumentMetadataAndContent?  Are there enough specifics in the
>> > image/video processing stuff to warrant that to be a separate layer;
>> > perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might it make
>> > sense to rename nifi-media-nar into nifi-text-extract-nar ?
>> >
>> > Thanks,
>> > - Dmitry
>> >
>> >
>> >
>> > On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]> wrote:
>> >
>> > > Dmitry,
>> > >
>> > > Yeah, I agree, Tika is pretty impressive.  The original ticket,
>> NIFI-615
>> > > <https://issues.apache.org/jira/browse/NIFI-615>, wanted extraction
>> of
>> > > metadata from WAV files, but as I got into it I found Tika so for the
>> > same
>> > > effort it supports the 1,000+ file formats Tika understands.  That new
>> > > processor called "ExtractMediaMetadata", you can pull that pull PR-252
>> > > <https://github.com/apache/nifi/pull/252> from GitHub if you want to
>> > give
>> > > it a try before it's merged.
>> > >
>> > > Extraction content for those 1,000+ formats would be a valuable
>> addition.
>> > > I see two possible approaches, 1) create a new "ExtractMediaContent"
>> > > processor that would put the document content in a new flow file, and
>> 2)
>> > > extend the new "ExtractMediaMetadata" processor so it can extract
>> > metadata,
>> > > content, or both.  One combined processor makes sense if it can
>> provide a
>> > > performance gain, otherwise two complementary processors may make
>> usage
>> > > easier.
>> > >
>> > > I'm glad to help if you want to take a cut at the processor yourself,
>> or
>> > I
>> > > can take a crack at it myself if you'd prefer.
>> > >
>> > > Don't hesitate to ask questions or share comments and feedback
>> regarding
>> > > the ExtractMediaMetadata processor or the addition of content
>> handling.
>> > >
>> > > Regards,
>> > > Joe Skora
>> > >
>> > > On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
>> > > [email protected]> wrote:
>> > >
>> > > > Thanks, Joe!
>> > > >
>> > > > Hi Joe S. - I'm definitely up for discussing and contributing.
>> > > >
>> > > > While building search-related ingestion systems, I've seen metadata
>> and
>> > > > text extraction being done all the time; it's always there and
>> always
>> > has
>> > > > to be done for building search indexes.  Beyond that, OCR-related
>> > > > capabilities are often requested, and the advantage of Tika is that
>> it
>> > > > supports OCR out of the box.
>> > > >
>> > > > - Dmitry
>> > > >
>> > > > On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]>
>> wrote:
>> > > >
>> > > > > Dmitry,
>> > > > >
>> > > > > Another community member (Joe Skora) has a PR outstanding for
>> > > > > extracting metadata from media files using Tika.  Perhaps it makes
>> > > > > sense to broaden that to in general extract what Tika can find.
>> Joe
>> > -
>> > > > > perhaps you can discuss your ideas with Dmitry and see if
>> broadening
>> > > > > is a good idea or if rather domain specific ones make more sense.
>> > > > >
>> > > > > This concept of extracting metadata from documents/text files,
>> etc..
>> > > > > using something like Tika is certainly useful as that then can
>> drive
>> > > > > nice automated routing decisions.
>> > > > >
>> > > > > Thanks
>> > > > > Joe
>> > > > >
>> > > > > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
>> > > > > <[email protected]> wrote:
>> > > > > > Hi,
>> > > > > >
>> > > > > > I see that the ExtractText processor extracts text using regex.
>> > > > > >
>> > > > > > What about a processor that extracts text and metadata from
>> > incoming
>> > > > > > files?  That doesn't seem to exist - but perhaps I didn't quite
>> > look
>> > > in
>> > > > > the
>> > > > > > right spots.
>> > > > > >
>> > > > > > If that doesn't exist I'd like to implement and commit it, using
>> > > Apache
>> > > > > > Tika.  There may also be a couple of related processors to that.
>> > > > > >
>> > > > > > Thoughts?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > - Dmitry
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Text and metadata extraction processor

Reply via email to