Hi,

From: Leo Sauermann [mailto:[email protected]]
> RDF is the only cross-format standard out there, there are standardized
> representations in XML, JSON, HTML, and databases. That would make it a
> good fit for frameworks, such as Tika.

Agreed. The idea of using XMP (a metadata model based on RDF) has come up every 
now and then on d...@tika (see the archives), and I think that's what we should 
be working towards. Note however that the scope of Tika has at least so far 
been intentionally smaller than that of Aperture.

For example, we explicitly don't try to preserve the full structural or 
semantic details of parsed documents. Thus the points about mapping VCARD or 
ICAL data to RDF are somewhat irrelevant for Tika, as we'd just map such data 
to semi-structured XHTML whose main purpose is to support full text indexing or 
other unstructured text processing applications. In other words, Tika is lossy 
by design.

Another point, more related to recursive metadata, is that we make no attempt 
at defining a representation for compound documents. The rationale for this is 
that such representations are necessarily application- or domain-specific. Tika 
avoids making those design choices by having the Parser API only recognize 
singular documents, but allowing programmatic access to subdocuments through 
the EmbeddedDocumentExtractor (or the more general ParseContext) mechanism. A 
client application can use these tools to construct any kind of hierarchical 
metadata structures.

To summarize: yes, I think RDF is a good idea for Tika, but only in terms of 
extending our metadata model to XMP. I don't see how RDF would be more useful 
than XHTML in representing the full text content of a document; at least as 
long as we're not looking at radically extending the scope of Tika.

BR,

Jukka Zitting

Reply via email to