Hi Tika, (cc Aperture, just fyi) I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata
The problems don't stop there, if you think it through you end up with zip-files containing zip-files containing .pst and email files containing attached word documents containing embedded excel. In the sourceforge project "Aperture" (its similar to Tika) the solution was to use the W3C standard RDF which allows endlessly stacking information into each other. This was also used in the NEPOMUK-KDE linux implementation, but there in C++ and with a slightly different angle to it. it may be useful to check out their documentation and their status of dicussion: the data model: http://www.semanticdesktop.org/ontologies/ this is the specific model of stacking things into each other: http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ the stacking/recursive problem was solved using "subcrawlers": http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers general structure of things coming together: http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure >From my experience (I am co-author and was initiator of most of the above) there is only a limited short-term benefit of adopting this thinking, but a bigger long-term benefit as being compatible with RDF/W3C will on the long turn make Tika compatible with what happens in HTML5 and other standardization efforts. Looking at this stuff could help as a guideline for decisions in Tika. So - Could anyone please think about it for a minute and add these links and some ideas how to deal with it to http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata ? best Leo Sauermann, Dr. CEO and Founder p.s. There used to be a much closer tie between tika and aperture in 2007, but as Aperture development is kind of finished (its in production now at some places and fixes only done when needed) it seems communication between them has lowered a bit. Anyone knows why? mail: [email protected] mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________
