Re: Text and metadata extraction processor

Joe Skora Tue, 29 Mar 2016 11:38:13 -0700

Dmitry,

Yeah, I agree, Tika is pretty impressive.  The original ticket, NIFI-615
<https://issues.apache.org/jira/browse/NIFI-615>, wanted extraction of
metadata from WAV files, but as I got into it I found Tika so for the same
effort it supports the 1,000+ file formats Tika understands.  That new
processor called "ExtractMediaMetadata", you can pull that pull PR-252
<https://github.com/apache/nifi/pull/252> from GitHub if you want to give
it a try before it's merged.

Extraction content for those 1,000+ formats would be a valuable addition.
I see two possible approaches, 1) create a new "ExtractMediaContent"
processor that would put the document content in a new flow file, and 2)
extend the new "ExtractMediaMetadata" processor so it can extract metadata,
content, or both.  One combined processor makes sense if it can provide a
performance gain, otherwise two complementary processors may make usage
easier.

I'm glad to help if you want to take a cut at the processor yourself, or I
can take a crack at it myself if you'd prefer.

Don't hesitate to ask questions or share comments and feedback regarding
the ExtractMediaMetadata processor or the addition of content handling.

Regards,
Joe Skora

On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
[email protected]> wrote:

> Thanks, Joe!
>
> Hi Joe S. - I'm definitely up for discussing and contributing.
>
> While building search-related ingestion systems, I've seen metadata and
> text extraction being done all the time; it's always there and always has
> to be done for building search indexes.  Beyond that, OCR-related
> capabilities are often requested, and the advantage of Tika is that it
> supports OCR out of the box.
>
> - Dmitry
>
> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]> wrote:
>
> > Dmitry,
> >
> > Another community member (Joe Skora) has a PR outstanding for
> > extracting metadata from media files using Tika.  Perhaps it makes
> > sense to broaden that to in general extract what Tika can find.  Joe -
> > perhaps you can discuss your ideas with Dmitry and see if broadening
> > is a good idea or if rather domain specific ones make more sense.
> >
> > This concept of extracting metadata from documents/text files, etc..
> > using something like Tika is certainly useful as that then can drive
> > nice automated routing decisions.
> >
> > Thanks
> > Joe
> >
> > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> > <[email protected]> wrote:
> > > Hi,
> > >
> > > I see that the ExtractText processor extracts text using regex.
> > >
> > > What about a processor that extracts text and metadata from incoming
> > > files?  That doesn't seem to exist - but perhaps I didn't quite look in
> > the
> > > right spots.
> > >
> > > If that doesn't exist I'd like to implement and commit it, using Apache
> > > Tika.  There may also be a couple of related processors to that.
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > - Dmitry
> >
>

Re: Text and metadata extraction processor

Reply via email to