While I'm still tweaking and fixing small bugs, the text indexing stuff that I've been working on is ready for rolling into the trunk. Here's a quick overview of what's going on:

In order to facilitate text indexing, I added some simple event functionality that allows handlers (callables) to subscribe to events that happen to graphs. When a triple is added to a graph, an event is fired and callables that have subscribed to that event are handed the event and the triple that triggered it. As complex as it sounds, it's only a few lines of Python code. A new kind of graph called a TextIndex subscribes to graph events and indexes any triples that are added to the data graph. So there are two graphs when text indexing, the data graph which contains the source of all the triples, and the index graph that contains the index information of the data graph. The core of events is here:

http://svn.rdflib.net/branches/michel-events/rdflib/events.py

Graph support is done by a subclass:

http://svn.rdflib.net/branches/michel-events/rdflib/EventGraph.py

although that should probably be rolled into Graph.

Events remove the need to subclass and override Graph.add/remove to add special behavior. It keeps complex stuff out of Graph. In fact I imagine some complex stuff that is now in Graph could be refactored into event handlers. Of note I discovered is that it is absurdly difficult to subclass Graph these days, you have to subclass Graph, and ConjunctiveGraph, and override CG.parse at a bare minimum to enhance Graph. Event handlers remove a lot of the need to subclass as I've found overriding add/remove to be the most common case.

The index graph will index any triple that has a object literal string. I won't bore you with the mundane triple calculus, but the basic operation is that literal text is "split" into terms. The terms are filtered to remove very common terms (currently english only) and each term is stored in an RDF graph along with other statements that says which statements the term occured in in the *data* graph. Here's an example using the boston.openguides.org data (about 80K triples data graph size, 86K triples index graph size):

>>> for s,p,o in t.search('coffee'): print s
...
http://boston.openguides.org/?id=Sunrise_Coffee_%26_Sub_Shop_Ii;format=rdf#obj
http://boston.openguides.org/?id=Two_Sisters_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Winston's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Sonny's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Topsfield_Bagel_Co_%26_Coffee;format=rdf#obj
http://boston.openguides.org/?id=Romano's_Bakery_%26_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Co;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Company;format=rdf#obj

Here I'm just printing the subject, but you can print the predicate that the term occurs in as well (Note the term 'coffee' occurs in the *objects* of these statements which for brevity here I am only showing the *subjects*. In this data, the object is the name of the company which the subject URI also happens to be based on, but the subjects are not indexed, only the objects). The object is always None, since the text index does not store the object literal itself (that would make the index huge) and that data is easily available from the source graph (if you keep it around). For convienience, text indexes support a 'link_to' method that lets you link a TI to a data graph so that when you query it it returns the real object literal value instead of None. A straightforward doctest and all the gory details are in:

http://svn.rdflib.net/branches/michel-events/rdflib/TextIndex.py

In addition to facilitating text indexing, this events adds some other bonus functionality, like multiple stores can subscribe to one graph, and when the one graph changes, all the stores changes as well. Note that the graph is still only backed by one graph that is actually queried and considered the backend for the graph, but other stores can "follow along" the changes made to the primary store. For example, a graph can have an in memory backend, and also have a database store subscribe to the graph, so that any changes to the in memory graph are written through to the db as well. I have also used a subscriber that keeps a count of all the triples added to the graph, and commits a ZODB subtransaction or prints a progress message when a threshold is hit. This could solve the "upkeep" use case Chimzie mentioned earlier.

Comments? I'll probably be rolling this in this week if there is no conflicting work going on. This work was paid for by the good folks at Six Feet Up and we are starting to use this code in production. Searching time is very fast for even very large data sets, and I was able to index 2MT from a 10MT Swoogle dump and still get subsecond search speeds for searches. I think this could really differentiate rdflib, especially given that we are not text indexing using xapian or any kind of black box, but instead keep all the index data in rdf itself. This makes it ultimately portable.
_______________________________________________
Dev mailing list
[email protected]
http://rdflib.net/mailman/listinfo/dev

Reply via email to