On Thu, Nov 18, 2010 at 10:44 PM, Stuart Bishop <stuart.bis...@canonical.com> wrote: > On Thu, Nov 18, 2010 at 10:28 AM, Robert Collins > <robe...@robertcollins.net> wrote: > >> The notation I'm going to use is this: >> 'foo' : the literal value foo. >> foo: a variable representing foo >> ...: Repeated things. >> + prefixing a column name : 'has a secondary index' >> (Thing) : this row is sorted on Thing. For instance >> 'Address':value(timestamp) - sorted on the timestamp. >> >> ColumnFamily(aka Table): CF|SCF (ColumnFamily or SuperColumnFamily) >> >> Row-Key : [+]ColumnName:(value) >> >> Remember too that every concrete column - essentially a stored cell - >> has a timestamp on it. > > Have you need any way of diagramming systems? I'm finding this and the > Riptano slides pretty unreadable, even for these toy examples.
I'm using a minor tweak on the presenters preferred style; we can certainly start using svg's/wiki pages or some such. I find it useful to remember that 'CF == table' - for all intents and purposes, except the static nature of the rows in regular SQL databases. > oauth - The issue in PostgreSQL is the nonce handling. This would be > in memcache now except that we are relying on atomic commits to avoid > race conditions in the replay avoidance stuff. Cassandra will hit the > same issue. For nonce handling, I think memcache is a better fit - > volatile is fine so keep it fast and avoid all that disk activity, and > if a nonce is consumed other clients need to know that immediately > (rather than waiting for information to replicate around). I'd like to correct the analysis here though - a write + read @ quorum is extremely fast in cassandra: its a single write to the WAL on each of quorum machines (e.g. 2), and reads will be coming from MemoryTable's so will have no IO. Reading at quorum is consistent - its not delayed by replication (Cassandra writes in parallel - doing a write at quorum when the replication factor is set to 3 does the following: coordinator (node the client is attached to) determines the nodes for the row to end up on coordinator sends all such nodes (3) the request in parallel coordinator waits for quorum (2) nodes to reply coordinator responds to the client Recommended setup has dedicated disk for the WAL - so all IO's are linear, no seeking, and the change is guaranteed once its in the WAL. Reads then, do the following: coordinator (node the client is attached to) determines the nodes for the row to come from coordinator sends all relevant (3) nodes the request in parallel coordinator waits for quorum (2) nodes to reply coordinator checks that both replied with the same value, and if they didn't it waits for more to see what the real value is coordinator responds to the client So in the oauth nonce case, we'd expect to be able to answer accurately mere milliseconds apart. Also, in terms of scaling, we can have N where N could be hundreds of machines all cooperating to handle huge load whereas with memcache to cope with more than one machine volume we'd need a sharding proxy in front of it. With that out of the way, what precisely does the nonce handling need to do? AIUI we want reliable, transient storage for issuing a new nonce on every API request? What happens if the nonce store were to get reset without warning? Would all current API clients handle it transparently? Or would they all barf. If they'd all barf, I think we probably want to do something to avoid that - the support / confidence impact of that sort of disruption is (IMO) pretty significant. > sessions - seems a decent fit. I'm not sure if the existing setup is a > problem that needs solving though. We wanted to pick some good sized things to think through, I think sessions are working ok right now. > memcache - Using memcache is essentially free because of its > limitations. I don't think Cassandra is a suitable replacement for our > current volatile-data-only usage of memcache. There have been some > things we decided memcache was not suitable that Cassandra could be a > better fit for. Memcache isn't totally free - we spend /seconds/ chatting to it on some pages. I need to do some rigorous testing, but the data I have so far is that bugtask:+index is slowed down more than it is sped up by memcache. > Is it suitable for replacing the bulk of the Librarian? We could, OTOH there is a SAN coming which will give our current implementation legs. > Disaster recovery will be an issue. We need things in place before we > put any data we care about into it. Absolutely; James Troup and Michael Barnett seemed reasonably confident about the options there. IIRC there are basically several stories. Firstly, remaining available can be done in a variety of ways, all of which are variations on : run multiple data centres with a network aware topology: request datacentre local consistency on writes and reads and trust that the other data centre won't be far behind. Secondly, backups - there is a near-instance backup facility in Cassandra per-node, it takes a hardlink of the current data, and you back that up at your leisure. Lastly, replicating into a new cluster would be something like hadoop - a massive map across everything. I'm going to write to mdennis to explore this a little more for the staging and qastaging querstions. > Staging and qa systems will be interesting. I'm not sure how things > could be integrated. I guess we would need to build a staging > cassandra database from a snapshot taken after the PostgreSQL dump was > taken, with missing data being ok because of 'eventually consistent'. Thats my understanding. > I don't see a win in replacing small systems that are not in trouble. > We may just as easily avoid the trouble by redesigning for PG or > memcache than by redesigning for Cassandra. Adding another moving > part like Cassandra introduces a lot of moving parts - too much > overhead for the toy systems. If we want to use it, I'd want to see it > used for a big system that could do with a performance boost. > Publishing history in soyuz, Branch/BranchRevision/Revision in > codehosting, *Message/Message/MessageChunk, > LibraryFileAlias/LibraryFileContent, full text search, karma. Thats a good point too. I think I touched on integration, but its fairly straight forward: to query across PG and Cassandra, we would have the PG row ids in Cassandra - either as row keys or as a dedicated index to the natural keys. Then we'd query as much as we can in PG, and issue a multiget or similar to get things from Cassandra, and reduce in the appserver. This obviously works poorly if we'd get back thousands of rows to filter / sort from PG - so we'd need to avoid that. There are a couple of strategies to do that. One is to have some of the tables present in PG and in Cassandra. Then when Cassandra can answer more efficiently, query Cassandra, when PG can answer more efficiently query PG. For instance, we might want to do set membership tests for BranchRevision on Cassandra, but reductions when querying for bugs with branches in PG. If we like the result we would migrate more and more until we have nothing that needs to query some table in PG, when we could drop it. We could use this without migrating in fact - just add the moving parts and get a hybrid doing different bits really well. Another is to migrate clusters of tables that are queried together, where the boundaries will have small numbers of ids that we'd need to cross-query. We could move LibraryFileContent for instance - we rarely need to query it, even though we access LFA all the time. And generally when accessing LFA we do so as part of branding many people - we could move that and multiget it quite satisfactorily. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp