Hi Joe,
Thanks for all the details.
I wanted to propose that I do some of this work so as to go through the
full cycle of developing a processor and committing it.
Once your changes are merged, I could extend your 'ExtractMediaMetadata'
processor to handle the content, in addition to the metadata.
We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode with 3
values: metadataOnly, contentOnly, metadataAndContent.
One thing that looks to be a design issue right now is, your changes and
the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
Would it make sense to have a generic processor
ExtractDocumentMetadataAndContent? Are there enough specifics in the
image/video processing stuff to warrant that to be a separate layer;
perhaps a subclass of ExtractDocumentMetadataAndContent ? Might it make
sense to rename nifi-media-nar into nifi-text-extract-nar ?
Thanks,
- Dmitry
On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]> wrote:
> Dmitry,
>
> Yeah, I agree, Tika is pretty impressive. The original ticket, NIFI-615
> <https://issues.apache.org/jira/browse/NIFI-615>, wanted extraction of
> metadata from WAV files, but as I got into it I found Tika so for the same
> effort it supports the 1,000+ file formats Tika understands. That new
> processor called "ExtractMediaMetadata", you can pull that pull PR-252
> <https://github.com/apache/nifi/pull/252> from GitHub if you want to give
> it a try before it's merged.
>
> Extraction content for those 1,000+ formats would be a valuable addition.
> I see two possible approaches, 1) create a new "ExtractMediaContent"
> processor that would put the document content in a new flow file, and 2)
> extend the new "ExtractMediaMetadata" processor so it can extract metadata,
> content, or both. One combined processor makes sense if it can provide a
> performance gain, otherwise two complementary processors may make usage
> easier.
>
> I'm glad to help if you want to take a cut at the processor yourself, or I
> can take a crack at it myself if you'd prefer.
>
> Don't hesitate to ask questions or share comments and feedback regarding
> the ExtractMediaMetadata processor or the addition of content handling.
>
> Regards,
> Joe Skora
>
> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> [email protected]> wrote:
>
> > Thanks, Joe!
> >
> > Hi Joe S. - I'm definitely up for discussing and contributing.
> >
> > While building search-related ingestion systems, I've seen metadata and
> > text extraction being done all the time; it's always there and always has
> > to be done for building search indexes. Beyond that, OCR-related
> > capabilities are often requested, and the advantage of Tika is that it
> > supports OCR out of the box.
> >
> > - Dmitry
> >
> > On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]> wrote:
> >
> > > Dmitry,
> > >
> > > Another community member (Joe Skora) has a PR outstanding for
> > > extracting metadata from media files using Tika. Perhaps it makes
> > > sense to broaden that to in general extract what Tika can find. Joe -
> > > perhaps you can discuss your ideas with Dmitry and see if broadening
> > > is a good idea or if rather domain specific ones make more sense.
> > >
> > > This concept of extracting metadata from documents/text files, etc..
> > > using something like Tika is certainly useful as that then can drive
> > > nice automated routing decisions.
> > >
> > > Thanks
> > > Joe
> > >
> > > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> > > <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > I see that the ExtractText processor extracts text using regex.
> > > >
> > > > What about a processor that extracts text and metadata from incoming
> > > > files? That doesn't seem to exist - but perhaps I didn't quite look
> in
> > > the
> > > > right spots.
> > > >
> > > > If that doesn't exist I'd like to implement and commit it, using
> Apache
> > > > Tika. There may also be a couple of related processors to that.
> > > >
> > > > Thoughts?
> > > >
> > > > Thanks,
> > > > - Dmitry
> > >
> >
>