Dmitry, Another community member (Joe Skora) has a PR outstanding for extracting metadata from media files using Tika. Perhaps it makes sense to broaden that to in general extract what Tika can find. Joe - perhaps you can discuss your ideas with Dmitry and see if broadening is a good idea or if rather domain specific ones make more sense.
This concept of extracting metadata from documents/text files, etc.. using something like Tika is certainly useful as that then can drive nice automated routing decisions. Thanks Joe On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg <[email protected]> wrote: > Hi, > > I see that the ExtractText processor extracts text using regex. > > What about a processor that extracts text and metadata from incoming > files? That doesn't seem to exist - but perhaps I didn't quite look in the > right spots. > > If that doesn't exist I'd like to implement and commit it, using Apache > Tika. There may also be a couple of related processors to that. > > Thoughts? > > Thanks, > - Dmitry
