Thanks Leo, we'll take a look. FYI, one of the goals of Tika is to be extremely light-weight, and to provide canonical metadata representation, independent of any particular "view" of metadata, which in my mind RDF is as much of as e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out there. Sure it comes with inference, and all of the other promised goodies, but in my experience, I've seen little real use of those in data management systems. I've seen more use of RDF as a nice, compact XML format to represent metadata and allow interchange than anything else. I'd be opposed to making it the standard in Tika though, as I said b/c to me it's just a view.
Regardless, thanks for reaching out and I have a number of downstream ideas for helping Tika become more useful for showing different metadata "views" as I call them and plan on starting to implement/contribute some of them in the coming year, as soon as this book [1] starts to wrap up :) I think a number of other Tika community members have been doing a fantastic job at keeping the metadata capabilities in Tika simple, light-weight, and feature-rich, and I expect it to continue down that path. Cheers, Chris [1] http://www.manning.com/mattmann/ On 11/14/10 1:13 AM, "Leo Sauermann" <[email protected]> wrote: Hi Tika, (cc Aperture, just fyi) I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata The problems don't stop there, if you think it through you end up with zip-files containing zip-files containing .pst and email files containing attached word documents containing embedded excel. In the sourceforge project "Aperture" (its similar to Tika) the solution was to use the W3C standard RDF which allows endlessly stacking information into each other. This was also used in the NEPOMUK-KDE linux implementation, but there in C++ and with a slightly different angle to it. it may be useful to check out their documentation and their status of dicussion: the data model: http://www.semanticdesktop.org/ontologies/ this is the specific model of stacking things into each other: http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ the stacking/recursive problem was solved using "subcrawlers": http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers general structure of things coming together: http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure >From my experience (I am co-author and was initiator of most of the above) there is only a limited short-term benefit of adopting this thinking, but a bigger long-term benefit as being compatible with RDF/W3C will on the long turn make Tika compatible with what happens in HTML5 and other standardization efforts. Looking at this stuff could help as a guideline for decisions in Tika. So - Could anyone please think about it for a minute and add these links and some ideas how to deal with it to http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata ? best Leo Sauermann, Dr. CEO and Founder p.s. There used to be a much closer tie between tika and aperture in 2007, but as Aperture development is kind of finished (its in production now at some places and fixes only done when needed) it seems communication between them has lowered a bit. Anyone knows why? mail: [email protected] mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
