Hi, From: Leo Sauermann [mailto:[email protected]] > RDF is the only cross-format standard out there, there are standardized > representations in XML, JSON, HTML, and databases. That would make it a > good fit for frameworks, such as Tika.
Agreed. The idea of using XMP (a metadata model based on RDF) has come up every now and then on d...@tika (see the archives), and I think that's what we should be working towards. Note however that the scope of Tika has at least so far been intentionally smaller than that of Aperture. For example, we explicitly don't try to preserve the full structural or semantic details of parsed documents. Thus the points about mapping VCARD or ICAL data to RDF are somewhat irrelevant for Tika, as we'd just map such data to semi-structured XHTML whose main purpose is to support full text indexing or other unstructured text processing applications. In other words, Tika is lossy by design. Another point, more related to recursive metadata, is that we make no attempt at defining a representation for compound documents. The rationale for this is that such representations are necessarily application- or domain-specific. Tika avoids making those design choices by having the Parser API only recognize singular documents, but allowing programmatic access to subdocuments through the EmbeddedDocumentExtractor (or the more general ParseContext) mechanism. A client application can use these tools to construct any kind of hierarchical metadata structures. To summarize: yes, I think RDF is a good idea for Tika, but only in terms of extending our metadata model to XMP. I don't see how RDF would be more useful than XHTML in representing the full text content of a document; at least as long as we're not looking at radically extending the scope of Tika. BR, Jukka Zitting
