Hi Thorsten
Thorsten Möller wrote:
Am 09.09.2011 um 00:20 schrieb Paolo Castagna:
Hi,
one feature I miss from Fuseki (and probably everybody who wants to use Fuseki to run
SPARQL endpoints, public or private) is "high availability".
One way to increase the availability of your SPARQL endpoints is to use
"replication": you serve the same data with multiple machines and you put a
load balancer in front of it, distributing queries to those machines.
I'd like something relatively easy to setup, close to zero admin costs at
runtime and which would make extremely easy to provision a new replica if
necessary.
I'd like something so easy, I could implement it in a couple of days.
Here are some thoughts (inspired by Solr master/slave architecture and
replication)...
One master, N slaves. All running exactly the same Fuseki code base.
All writes go to the master. If the master is down, no writes are possible
(this is the price to pay in exchange of extreme simplicity).
All reads go to the N slaves (+ master?). Masters might be out of synch between
each others (this is the price to pay in exchange of extreme simplicity).
Each slave every once in a while (configurable) will contact the master to
check if it has received updates since the last time the slave has checked.
An additional servlet needs to be added to Fuseki to co-ordinate replication.
Replication happens simply by running rsync (from Java), from the master
pushing out stuff onto a remote slave.
During replication the master must acquire an exclusive write lock, sync TDB
indexes on disk, rsync with slave, release lock when finished or after a
timeout long enough for a first sync to be successful.
Slaves need to drop eventual TDB caches after a rsync.
The things I like about this approach are:
- it's very simple
- after the first time, rsync is incremental and quite fast
- provision a new slave is trivial
- it could be implemented relatively quickly
- if the master receives an update while one or more slaves are down, slaves
will catch up later
The things I do not like are:
- the master is a SPOF for writes
- you need an external load balancer to send requests to the slave(s)
- slaves can be out of sync for a period of time
- can a slave answer queries during the replication with rsync?
What do you think?
That's all well thought out, but it's more of an approach to scalability than availability. The reason is that the single write-only master remains a SPOF; hence, the availability does not change regarding writes. Of course, one could assume that the master has 100% availability but then there is no need to do something about availability.
You are right, this does not improve the availability (nor scalability) for
writes. But, you could have (as we have) use cases where you are in control of
the writes and you do not let your users (or all of your users) to issue SPARQL
Updates. In this case, your users will not experience problems even if your
master is down.
You need to queue updates and retry later the master is not available.
This in a lot of cases would be perfectly fine.
I don't have an alternative who would be so simple to implement and which would
increase the availability (in particular for reads) of a bunch of machines
running Fuseki.
Replication is inherently a tradeoff between consistency and performance; recap
the famous CAP theorem [1].
Yep.
Replication is good at increasing your throughput in terms of queries/sec and
it can have minor effects in reducing query response time because you have less
concurrent queries per machine.
Caching will also be very useful and this is, by no means, a substitute for
that. We would love to have both! The two things can and IMHO should coexist.
In my opinion, availability should come first and caching soon after.
The problem with caching first with no availability is that you have a problem
with cache misses. Your users don't get their responses back.
Another approach would be to look at the Journal in TxTDB and see if we could
have a Journal which would replicate changes with remote machines.
Another alternative would be to submit a SPARQL Update or a write request to
each of the replicas.
This seems to be a very interesting alternative since it would basically blow away all
data replication. Contemplating a little more, a small update query can lead to the need
to replicate the entire data in worst case (e.g., delete everything), whereas costs for
replicating a query of a few hundred bytes are low. In other words, replicating queries
does not differ much in the costs whereas costs for replicating the data that was changed
as a result of a query can differ dramatically. If you further assume eventual
consistency [2] (i.e., all nodes eventually have the same state of the data) then it
would also not matter how fast all the nodes are in processing the queries. You would
"only" need global update query ordering in order to ensure that all nodes
eventually have the same state, otherwise, if nodes might receive update queries in
different order, nodes diverge in the state of the data over time. Note that readers
might, of course, see different results depending on
which node they query - that's the price of eventual consistency.
The 'difficult' bit here would be the need for a "global update query ordering".
Guess what we do at Talis? https://github.com/talis/H1 ;-)
However, you need to manage different services, including a place where to store
updates to made them available to other nodes, in case when you try to apply an
update the node is not available or something goes wrong.
Also, in this scenario the provisioning of a new slave requires some "trivial"
intervention (i.e. copying the TDB indexing while the master is in read only mode).
I completely agree with you that reissuing queries is cheaper in terms of
network transfer, in particular if a lot of data changes per update.
However, I like the extreme simplicity of the rsync approach and the fact that
the master does not need to know who the slave(s) are, how many and if they are
available or not at the time it needs to apply an update.
No need for a global update query ordering or queuing system or repository for
updates.
I am just sharing some thoughts to explore pros/cons of different approach.
To me, any kind of replication mechanism would be better than none at all or
than letting users to design their own.
Paolo
Thorsten
[1] http://en.wikipedia.org/wiki/CAP_theorem
[2] http://en.wikipedia.org/wiki/Eventual_consistency
In this case, what do we do if one of the replica is down?
Other alternatives?
Is this something you need or you would use?
Paolo