Hey Nick, We've done something like this before a lot in the OODT project [1], in fact, in many cases wrapping Tika to do it.
Check out our CmdLineMetExtractor class [2], and this guide [3] on some of our baked in MetExtractors. I think it would be awesome if we could support a similar interface in Tika (I'd love to push those details upstream of OODT). Cheers, Chris [1] http://oodt.apache.org [2] http://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/extractors/CmdLineMetExtractor.java [3] http://oodt.apache.org/components/maven/metadata/user/basic.html On Apr 5, 2011, at 1:31 PM, Nick Burch wrote: > Hi All > > I'm currently pondering trying to add support for using ffmpeg to provide > metadata on video (and audio) files. This would be useful for me for the > file formats which we don't currently support, which is generally the ones > where there's no handy Java library to call for them. > > At the moment, it looks like we do have some command line support, in the > form of ExternalParser, but that's focused only on the text extraction > part. It also looks like it might want a few tweaks to make it easier to > use. > > I was therefore thinking of doing some work to improve it, and then adding > in metadata too (likely via regexps or similar). One thought was to make > it possible to use ExternalParser in two ways. The first way would be to > subclass it and provide the mime type, command, and metadata regexps. The > other would be to provide an xml config file, which'd supply the details. > Likely with both of these we'd want the parser to check for the external > command, and claim not to be available if the command isn't there. > > Anyone got any thoughts on this sort of thing? Anyone done something like > it before? > > Nick ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
