Hi Francis, I was talking with someone at work about Mnesia, which sounds like it's worth considering. It is distributed among N nodes, so it's good for problems that require good cache locality, i.e. do a lot with the data (because all data is on every node and replicates everywhere quickly). For some types of data sets that breaks down quite soon of course (you pretty much want to only have up to RAM-size large dataset, e.g. up to 64 GB). Mnesia cares about replication of changes all around, about failed notes, netsplits and syncing back from them etc.
I don't know much about MongoDB or CouchDB. Maybe you have to manage syncing yourself on the application layer, but they probably scale much further (depending on what you do in your application). But you could also have smaller clusters of Mnesia nodes and application code replicating between them and multiplying presence of buckets across the clusters that are requested often or something such. Another global Mnesia to hold routing information (which bucket where). So a combination might also make sense, Mnesia for the routing information on broker nodes and CouchDB or Memcached or MongoDB on the storage nodes with the large blobs of tile and other precomputed data. So your application severs would pick a broker node at random, ask it where some blob is and pass through the blob from the storage node to the client. The brokers could also increment per-object access counters and run some async jobs to have frequently accessed objects copied to more storage nodes etc. Instead of NFS for distributing tiles, you could consider a web service running off an httpd server like nginx. Another possibility for the entire infrastructure is Google App Engine, which utilises BigTable for fast, distributed data indexing and querying, and serves apps from a python or java runtime. There is a queue API, a memcached API, a simple image manipulation API, and a very good pricing model, which works out considerably cheaper than AWS for all models I've considered; for example, CPU time is theoretically billed at the same rate in AWS and GAE, but in GAE you just pay for real CPU time, compared with AWS where you pay for instance uptime. Of course, the price you pay for the cheapness and free scaling in GAE is lack of control, and lack of customer service, and no choice of where the data is stored (but I don't think mapumental has data privacy concerns...?) . The flip side to the lack of control is that the complexity is constrained. Personally I'm impressed by GAE and will be continuing to use it on new projects where I can, but I've not used it on a massively resource-intensive job yet. The only part of a GAE app that isn't easily portable to a new architecture is the datastore access, which can be abstracted away easily enough, so you could always chose to migrate from GAE to AWS at a later date. Seb 2009/8/14 Francis Irving <[email protected]>: > Mapumental is a website which shows contour maps of public transport > travel times, house prices and other data. It's in closed beta. > > http://mapumental.channel4.com/ > > It uses lots of CPU running the transport route finding for each > postcode, and rendering the tiles as they are served. > > Before we can openly release it, we need to make it scale easily > (say, on Amazon Web Services). > > Currently it is using > * A PostgreSQL database to store the points behind the static datasets > such as scenicness and house prices. > * Binary files on NFS to store the generated datasets of travel times. > PostgreSQL was too slow, and used too much memory, to load in the > large number of rows that would be required (300,000 for each user entered > postcode). > * A rendered tile cache, containing PNG files on the NFS filesystem. > * PostgreSQL for queueing the jobs for the transport route finder. > > We now want to: > * make the site scale easily (on Amazon Web Service), > * make it easy to add more data sets. > We had problems with NFS, so I need something to replace the binary > files in NFS and the tile cache. It might also be prudent to use > something easier to scale than a PostgreSQL database, although I > suspect the load on it would be low so perhaps it isn't a problem. > > So the new version of Mapumental that I'm currently plannning has to > store: > a) cache of tiles rendered (some fairly generated rarely > and frequently accessed e.g. house prices, some not accessed > often compared to generation times, e.g. public transport route) > b) coordinates and values of arbitary point datasets (e.g. > school quality, asthma air quality, wind speed, route by > car to a particular postcode etc. etc.) > > I'm looking for good, open source, alternatives to NFS and PostgreSQL > to do this. Distributed data stores and queueing systems. > > What should I look at? What can I trust? > > I've already surveyed the field, and have my own ideas about what to > do, but would be interested if anyone here has some experience or > views on any of the obvious technologies. > > I'd like it to be stable and mature, and realistically it would > already be in a Debian package. > > Francis > > _______________________________________________ > Mailing list [email protected] > Archive, settings, or unsubscribe: > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public > -- skype: seb.bacon mobile: 07790 939224 _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
