Hi Stefano, others, You may want to have a look at the following discussion on the openrdf forum: http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423
Among others, it discusses some work done on top of Hadoop/Hbase that not only applies the map/reduce mechanism to search key-value pairs, but to evaluate relational operators! I think this work is an ideal platform to accomplish what you write about. Writing a Sail layer to communicate with such a cluster looks very doable. -- Arjohn Kampman, Senior Software Engineer Aduna - Guided Exploration www.aduna-software.com Stefano Mazzocchi wrote: > [apologies for the cross-posting] > > There is a trend emerging in the IT space about system architectures: if > you software system is not designed with N>1 in mind from the start > (where N is the number of coordinated instances of the software running > on share-nothing machines) it's going to be a problem later. > > Google, Yahoo and Amazon are famous for their N>1 architectural philosophy. > > Normal web sites infrastructures are heavily multi-tiered and > horizontally scalable in several of these tiers, but the data-management > layer is notoriously N=1, at least in principle and it's pretty much a > given that today's growing pains on web infrastructure scalability is > around the data-management layer (which is normally an RDBMS). > > We (SIMILE) are currently developing a system that tries really hardt to > be N>1-friendly but uses Sesame HTTP Sails as the persistent > data-management layer and memcached as a way to avoid querying the > triple-store unless absolutely necessary. My queries are normally very > simple, stuff like "s ?p ?o" or "s p ?o" or "?s p o" which, in fact, > don't need a triple store at all, a simple key-value store (such as > Amazon Dynamo or CouchDB) would do just fine. > > But there are times (rare, but important), where a little more complex > query might be required... in order to avoid making hundreds of > key-value calls. > > Right now, under development and with practically zero load, the HTTP > Sail performance is completely reasonable (especially since memcached or > local stores handles most of the load anyway) but I'm growing more and > more concerned about relying on a data-management tier that is, in fact, > designed with a N=1 architecture in mind. > > Sure, I could use a mysql store instead of a native store, in case the > native store performance turns out to be suboptimal... or we could write > a BerkeleyDB-based native store to squeeze out some more performance... > or we could add more processors to the machine.... but if your web site > starts to grow, it grows quadratically, and there is no way you can > scale hardware on a single machine quadratically (without having your > own infrastructural costs grow even more than that!). > > According to their Dynamo's paper[1], Amazon's requirement for 'quality > of service' of the persistent data management layer is "99.9% of the > requests have to be answered in less than 300ms with a load of 500 > requests/sec". > > Obviously, we don't have such high requirement for quality of service, > but I would very much like to have "95% of the requests have to be > answered in less than 300ms with a load of 30 req/sec" which is a *lot* > more feasible in real life but still incredibly problematic with a N=1 > architectural vision (especially for fault-tolerance). > > So, the trend is to move the higher level data management and semantics > completely to the application level and to rely on fast, massively > scalable and completely decentralized and self-managing key-value > 'clouds' that expose a super simple "get/set/delete" hashtable-like API, > even as a web service. > > My bet is that we'll see a lot of such "dynamo"-like systems emerging in > the future, more or less easy to maintain and to manage, more or less > reliable, more or less written in widely known languages (Dynamo is > written in Java) or powerful but largerly unknown ones (CouchDB is > written in Erlang). > > The question then becomes: what about triple stores? > > One of the reasons why triple stores are appealing as a data-management > tier in a web application is that they favor a data-centric incremental > development: it's practically impossible to know ahead of time what your > relational data model is going to look like once your web site goes in > production when you start prototyping it. Data-first data-management > approaches (triple stores, key-value stores, OODBMS) are much more > natural in following the evolution of a prototype than Structure-first > data management approaches (current-generation SQL-based RDBMS). > > But unlike OODBMS (then) and triple stores (now), key-value stores are > the only one that focus on delivering performance more than on > delivering RDBMS-like functionality. > > Let's face it: RDF is nothing but the good old entity-relationship model > (which is the base of any relational database) with URIs sprinkled on > top. And if it was possible to 'scale' an implementation of the > entity-relationship model without requiring you to 'compile' its > structure/schema into the database, it would already exist. > > Instead, the trend is to go key-value pair and map/reduce jobs to > 'precompile' the queries and keeping N>1 firmly in mind. > > [Column stores could be seen as an bunch of clever disk-I/O-influenced > optimization of the above... but they still feel N=1 deep in their > souls, which concerns me] > > - o - > > But here is what I think: key-value stores are great, simple to use and > refreshingly scalable, would the need emerge: if there was an open > source Dynamo today, I would probably use it. > > But there isn't.. and I wonder: how hard would it be to adapt the Sesame > implementation to start considering itself a N>1 application? could we > obtain the "95% req under 300ms with 30 req/sec" QoS with "s p ?o" queries? > > I don't care if SPARQL is slow, I'll cache the results or throw new > silicon at it.. but how hard would it be to make Sesame feel as > refreshing and comforting as Dynamo does (scalability and QoS-wise) at > least for the very basic data-management functionalities? > > Because it would be a huge win: for simple queries, it performs just > like a key-value store, for more complicated queries, it's slower but > still scalable. > > Yes, I perfectly understand that distribution means graph clustering, > minimum cuts, distributed transactions and all that tarpit that sank > most of the advancements in RDBMS technology over the last 20 years.... > but what if SPARQL queries are "injected" in ahead of time and computed > with map/reduce jobs? And, if not, such 'exploratory' queries will be > slow, who cares, so be it. > > What do you think? _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
