Thanks, Joe! Hi Joe S. - I'm definitely up for discussing and contributing.
While building search-related ingestion systems, I've seen metadata and text extraction being done all the time; it's always there and always has to be done for building search indexes. Beyond that, OCR-related capabilities are often requested, and the advantage of Tika is that it supports OCR out of the box. - Dmitry On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]> wrote: > Dmitry, > > Another community member (Joe Skora) has a PR outstanding for > extracting metadata from media files using Tika. Perhaps it makes > sense to broaden that to in general extract what Tika can find. Joe - > perhaps you can discuss your ideas with Dmitry and see if broadening > is a good idea or if rather domain specific ones make more sense. > > This concept of extracting metadata from documents/text files, etc.. > using something like Tika is certainly useful as that then can drive > nice automated routing decisions. > > Thanks > Joe > > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg > <[email protected]> wrote: > > Hi, > > > > I see that the ExtractText processor extracts text using regex. > > > > What about a processor that extracts text and metadata from incoming > > files? That doesn't seem to exist - but perhaps I didn't quite look in > the > > right spots. > > > > If that doesn't exist I'd like to implement and commit it, using Apache > > Tika. There may also be a couple of related processors to that. > > > > Thoughts? > > > > Thanks, > > - Dmitry >
