Hi Stefano, others,

You may want to have a look at the following discussion on the openrdf
forum:
http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423

Among others, it discusses some work done on top of Hadoop/Hbase that
not only applies the map/reduce mechanism to search key-value pairs, but
to evaluate relational operators! I think this work is an ideal platform
to accomplish what you write about. Writing a Sail layer to communicate
with such a cluster looks very doable.

-- 
Arjohn Kampman, Senior Software Engineer
Aduna - Guided Exploration
www.aduna-software.com


Stefano Mazzocchi wrote:
> [apologies for the cross-posting]
> 
> There is a trend emerging in the IT space about system architectures: if
> you software system is not designed with N>1 in mind from the start
> (where N is the number of coordinated instances of the software running
> on share-nothing machines) it's going to be a problem later.
> 
> Google, Yahoo and Amazon are famous for their N>1 architectural philosophy.
> 
> Normal web sites infrastructures are heavily multi-tiered and
> horizontally scalable in several of these tiers, but the data-management
> layer is notoriously N=1, at least in principle and it's pretty much a
> given that today's growing pains on web infrastructure scalability is
> around the data-management layer (which is normally an RDBMS).
> 
> We (SIMILE) are currently developing a system that tries really hardt to
> be N>1-friendly but uses Sesame HTTP Sails as the persistent
> data-management layer and memcached as a way to avoid querying the
> triple-store unless absolutely necessary. My queries are normally very
> simple, stuff like "s ?p ?o" or "s p ?o" or "?s p o" which, in fact,
> don't need a triple store at all, a simple key-value store (such as
> Amazon Dynamo or CouchDB) would do just fine.
> 
> But there are times (rare, but important), where a little more complex
> query might be required... in order to avoid making hundreds of
> key-value calls.
> 
> Right now, under development and with practically zero load, the HTTP
> Sail performance is completely reasonable (especially since memcached or
> local stores handles most of the load anyway) but I'm growing more and
> more concerned about relying on a data-management tier that is, in fact,
> designed with a N=1 architecture in mind.
> 
> Sure, I could use a mysql store instead of a native store, in case the
> native store performance turns out to be suboptimal... or we could write
> a BerkeleyDB-based native store to squeeze out some more performance...
> or we could add more processors to the machine.... but if your web site
> starts to grow, it grows quadratically, and there is no way you can
> scale hardware on a single machine quadratically (without having your
> own infrastructural costs grow even more than that!).
> 
> According to their Dynamo's paper[1], Amazon's requirement for 'quality
> of service' of the persistent data management layer is "99.9% of the
> requests have to be answered in less than 300ms with a load of 500
> requests/sec".
> 
> Obviously, we don't have such high requirement for quality of service,
> but I would very much like to have "95% of the requests have to be
> answered in less than 300ms with a load of 30 req/sec" which is a *lot*
> more feasible in real life but still incredibly problematic with a N=1
> architectural vision (especially for fault-tolerance).
> 
> So, the trend is to move the higher level data management and semantics
> completely to the application level and to rely on fast, massively
> scalable and completely decentralized and self-managing key-value
> 'clouds' that expose a super simple "get/set/delete" hashtable-like API,
> even as a web service.
> 
> My bet is that we'll see a lot of such "dynamo"-like systems emerging in
> the future, more or less easy to maintain and to manage, more or less
> reliable, more or less written in widely known languages (Dynamo is
> written in Java) or powerful but largerly unknown ones (CouchDB is
> written in Erlang).
> 
> The question then becomes: what about triple stores?
> 
> One of the reasons why triple stores are appealing as a data-management
> tier in a web application is that they favor a data-centric incremental
> development: it's practically impossible to know ahead of time what your
> relational data model is going to look like once your web site goes in
> production when you start prototyping it. Data-first data-management
> approaches (triple stores, key-value stores, OODBMS) are much more
> natural in following the evolution of a prototype than Structure-first
> data management approaches (current-generation SQL-based RDBMS).
> 
> But unlike OODBMS (then) and triple stores (now), key-value stores are
> the only one that focus on delivering performance more than on
> delivering RDBMS-like functionality.
> 
> Let's face it: RDF is nothing but the good old entity-relationship model
> (which is the base of any relational database) with URIs sprinkled on
> top. And if it was possible to 'scale' an implementation of the
> entity-relationship model without requiring you to 'compile' its
> structure/schema into the database, it would already exist.
> 
> Instead, the trend is to go key-value pair and map/reduce jobs to
> 'precompile' the queries and keeping N>1 firmly in mind.
> 
> [Column stores could be seen as an bunch of clever disk-I/O-influenced
> optimization of the above... but they still feel N=1 deep in their
> souls, which concerns me]
> 
>                                   - o -
> 
> But here is what I think: key-value stores are great, simple to use and
> refreshingly scalable, would the need emerge: if there was an open
> source Dynamo today, I would probably use it.
> 
> But there isn't.. and I wonder: how hard would it be to adapt the Sesame
> implementation to start considering itself a N>1 application? could we
> obtain the "95% req under 300ms with 30 req/sec" QoS with "s p ?o" queries?
> 
> I don't care if SPARQL is slow, I'll cache the results or throw new
> silicon at it.. but how hard would it be to make Sesame feel as
> refreshing and comforting as Dynamo does (scalability and QoS-wise) at
> least for the very basic data-management functionalities?
> 
> Because it would be a huge win: for simple queries, it performs just
> like a key-value store, for more complicated queries, it's slower but
> still scalable.
> 
> Yes, I perfectly understand that distribution means graph clustering,
> minimum cuts, distributed transactions and all that tarpit that sank
> most of the advancements in RDBMS technology over the last 20 years....
> but what if SPARQL queries are "injected" in ahead of time and computed
> with map/reduce jobs? And, if not, such 'exploratory' queries will be
> slow, who cares, so be it.
> 
> What do you think?

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Reply via email to