Re: Text and metadata extraction processor

Dmitry Goldenberg Thu, 24 Mar 2016 08:41:29 -0700

Thanks, Joe!

Hi Joe S. - I'm definitely up for discussing and contributing.


While building search-related ingestion systems, I've seen metadata and
text extraction being done all the time; it's always there and always has
to be done for building search indexes.  Beyond that, OCR-related
capabilities are often requested, and the advantage of Tika is that it
supports OCR out of the box.

- Dmitry

On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]> wrote:

> Dmitry,
>
> Another community member (Joe Skora) has a PR outstanding for
> extracting metadata from media files using Tika.  Perhaps it makes
> sense to broaden that to in general extract what Tika can find.  Joe -
> perhaps you can discuss your ideas with Dmitry and see if broadening
> is a good idea or if rather domain specific ones make more sense.
>
> This concept of extracting metadata from documents/text files, etc..
> using something like Tika is certainly useful as that then can drive
> nice automated routing decisions.
>
> Thanks
> Joe
>
> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
> <[email protected]> wrote:
> > Hi,
> >
> > I see that the ExtractText processor extracts text using regex.
> >
> > What about a processor that extracts text and metadata from incoming
> > files?  That doesn't seem to exist - but perhaps I didn't quite look in
> the
> > right spots.
> >
> > If that doesn't exist I'd like to implement and commit it, using Apache
> > Tika.  There may also be a couple of related processors to that.
> >
> > Thoughts?
> >
> > Thanks,
> > - Dmitry
>

Re: Text and metadata extraction processor

Reply via email to