Hi, ok, good feedback, thanks for taking the time to answer.
I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I have seen this discussion since 1999. btw, RSS meant "RDF site syndication"... so RDF>RSS but its an important discussion - so - more input RDF is the only cross-format standard out there, there are standardized representations in XML, JSON, HTML, and databases. That would make it a good fit for frameworks, such as Tika. of course, the 120 minutes it takes to learn RDF are longer than the 10 minutes it takes to learn JSON. My experience was, that for data integration projects, the 110 minutes pay off. I guess thats the reason why Facebook and Google dig RDF now... it is the only proper way to let data flow from databases out to the web and back into other databases. (thats what google now supports with price databases and the RDF-based "GoodRelations" ecommerce SEO format) if the consensus within Tika is - "rdf is too complex for us, we don't need it", that's fine. It took Sebastian Trüg about a year of discussion in the KDE mailinglists to explain why RDF is better suited for data integration in document indexing until the KDE people were convinced to switch the system search engine to RDF. some points: Inference - please ignore this, you don't need it. Field definition - you will soon have a problem in TIKA when you want to crawl VCARD and ICAL files and extract the full richness of ALL data embedded in those formats. Here RDF helped Aperture a lot. So for the whole area of Types and their Fields and subfields and hierarchical fields, RDF could help. XML - whatever, RDF is serialization-agnostic. It works best in internal APIs I guess, where data should flow from one component to another without being reformatted. Lets see it the other way round ? if you need info why RDF is better than anything else (ho ho ho), call the Aperture-dev mailinglist, people there are eager to help I guess. Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist if this is ok, I would cease this thread now from my side and say: if the question pops up, get in touch with Aperture or KDE people. if there is a need to get inspired, aperture people are there to help. I would guess the same is said for the KDE linux desktop indexing writers. There they also use RDF as format and there is an overarching standardization effort (OSCAF.org) amongst all of us.... that could also be a place to discuss, we had around a million eur spent just discussing about those RDF data formats (ontologies) that are now running ;-) I cc Sebastian Trüg in this mail, he is the main developer and boss-of-ontologies at KDE. I guess that Tika people are welcome to check out what happens on the KDE/Gnome side in the "Xesame" mailinglist. There is (not enough) documentation here whom to ask in case of questions: http://sourceforge.net/apps/trac/oscaf/wiki/Communication http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies best Leo It was Mattmann, Chris A (388J) who said at the right time 14.11.2010 17:48 the following words: > Thanks Leo, we'll take a look. > > FYI, one of the goals of Tika is to be extremely light-weight, and to > provide canonical metadata representation, independent of any > particular "view" of metadata, which in my mind RDF is as much of as > e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out > there. Sure it comes with inference, and all of the other promised > goodies, but in my experience, I've seen little real use of those in > data management systems. I've seen more use of RDF as a nice, compact > XML format to represent metadata and allow interchange than anything > else. I'd be opposed to making it the standard in Tika though, as I > said b/c to me it's just a view. > > Regardless, thanks for reaching out and I have a number of downstream > ideas for helping Tika become more useful for showing different > metadata "views" as I call them and plan on starting to > implement/contribute some of them in the coming year, as soon as this > book [1] starts to wrap up :) I think a number of other Tika > community members have been doing a fantastic job at keeping the > metadata capabilities in Tika simple, light-weight, and feature-rich, > and I expect it to continue down that path. > > Cheers, Chris > > [1] http://www.manning.com/mattmann/ > > On 11/14/10 1:13 AM, "Leo Sauermann" <[email protected]> > wrote: > > Hi Tika, (cc Aperture, just fyi) > > I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata > > > The problems don't stop there, if you think it through you end up > with zip-files containing zip-files containing .pst and email files > containing attached word documents containing embedded excel. > > In the sourceforge project "Aperture" (its similar to Tika) the > solution was to use the W3C standard RDF which allows endlessly > stacking information into each other. This was also used in the > NEPOMUK-KDE linux implementation, but there in C++ and with a > slightly different angle to it. > > it may be useful to check out their documentation and their status > of dicussion: > > the data model: http://www.semanticdesktop.org/ontologies/ > > this is the specific model of stacking things into each other: > http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ > > the stacking/recursive problem was solved using "subcrawlers": > http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers > > general structure of things coming together: > http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure > > > From my experience (I am co-author and was initiator of most of the > above) there is only a limited short-term benefit of adopting this > thinking, but a bigger long-term benefit as being compatible with > RDF/W3C will on the long turn make Tika compatible with what happens > in HTML5 and other standardization efforts. Looking at this stuff > could help as a guideline for decisions in Tika. > > > So - Could anyone please think about it for a minute and add these > links and some ideas how to deal with it to > http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata ? > > > best Leo Sauermann, Dr. CEO and Founder > > p.s. There used to be a much closer tie between tika and aperture in > 2007, but as Aperture development is kind of finished (its in > production now at some places and fixes only done when needed) it > seems communication between them has lowered a bit. Anyone knows > why? > > > mail: [email protected] mobile: +43 6991 gnowsis > http://www.gnowsis.com > > helping people remember, > > so join our newsletter > http://www.gnowsis.com/about/content/newsletter > ____________________________________________________ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion > Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: > 171-246 Email: [email protected] WWW: > http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department University > of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- Leo Sauermann, Dr. CEO and Founder mail: [email protected] mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________
