Hi Kevin, your idea could work for higher mega-byte ranges, I guess, don't know how about several TBytes.
We have been considering a concept to use Lucene as an RDF backend for a semantic search engine, because of its reported excellent scalability, on the order of tens of Megs. The idea was similar to yours and but we thought of using some index extension to introduce the class / properties hierarchy (i.e., RDF Schema) and make them searchable via cascaded index lookups. Didn't have the time, though, to test it, but would be grateful if you could comment. Here are the fields, in a draft with three index parts it's something like: node (unique) clss (class in schema) prop (position-ordered) prwt (a scalar value, weighting the relation or 1, position-ordered) rsrc (resource, position-ordered) and for the ontology itself: clss spcl (superclass, multi-inheritance) and prop (property) sprp (super-property, multi-inheritance) domn (domain) rnge (range) Best regards, gregor -----Original Message----- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Monday, August 11, 2003 12:33 AM To: [EMAIL PROTECTED] Subject: Lucene as a high-performance RDF database. I have been giving some thought to using Lucene as an RDF database. I'm specifically thinking about the RDF model and not the RDF syntax. Essentially this would just comprise triples encoded in a document as fields. So for example we would have subject predicate and object relationships as document fields. Subject and predicates would be Tokens and then the object field would be indexed. For example a triple (document) would be: http://jakarta.apache.org -> title -> "A great Java developer's website" This would be just one document in the index. This would have a lot of advantages most importantly speed and the reliability of Lucene and the ability to run a full text query on objects. For example we could query on "Java" and get back "http://jakarta.apache.org" The major downside I could see is that this would mean that we would be indexing a LOT of small documents with a LOT of index updates. Can anyone see any problems here? This database will eventually grow to around 2TB in the next month so performance issues are non-trivial. Most people have deployed Lucene with large document sizes and the fact that most people are citing document COUNT makes me nervous. Kevin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
