Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Seb Bacon Mon, 17 Aug 2009 01:46:43 -0700

Hi Francis,

I was talking with someone at work about Mnesia, which sounds like
it's worth considering. It is distributed among N nodes, so it's good
for problems that require good cache locality, i.e. do a lot with the
data (because all data is on every node and replicates everywhere
quickly). For some types of data sets that breaks down quite soon of
course (you pretty much want to only have up to RAM-size large
dataset, e.g. up to 64 GB). Mnesia cares about replication of changes
all around, about failed notes, netsplits and syncing back from them
etc.


I don't know much about MongoDB or CouchDB. Maybe you have to manage
syncing yourself on the application layer, but they probably scale
much further (depending on what you do in your application). But you
could also
have smaller clusters of Mnesia nodes and application code replicating
between them and multiplying presence of buckets across the clusters
that are requested often or something such. Another global Mnesia to
hold routing information (which bucket where).

So a combination might also make sense, Mnesia for the routing
information on broker nodes and CouchDB or Memcached or MongoDB on the
storage nodes with the large blobs of tile and other precomputed data.
So your
application severs would pick a broker node at random, ask it where
some blob is and pass through the blob from the storage node to the
client. The brokers could also increment per-object access counters
and run some async jobs to have frequently accessed objects copied to
more storage nodes etc.

Instead of NFS for distributing tiles, you could consider a web
service running off an httpd server like nginx.

Another possibility for the entire infrastructure is Google App
Engine, which utilises BigTable for fast, distributed data indexing
and querying, and serves apps from a python or java runtime.  There is
a queue API, a memcached API, a simple image manipulation API, and a
very good pricing model, which works out considerably cheaper than AWS
for all models I've considered; for example, CPU time is theoretically
billed at the same rate in AWS and GAE, but in GAE you just pay for
real CPU time, compared with AWS where you pay for instance uptime.
Of course, the price you pay for the cheapness and free scaling in GAE
is lack of control, and lack of customer service, and no choice of
where the data is stored (but I don't think mapumental has data
privacy concerns...?) . The flip side to the lack of control is that
the complexity is constrained.  Personally I'm impressed by GAE and
will be continuing to use it on new projects where I can, but I've not
used it on a massively resource-intensive job yet. The only part of a
GAE app that isn't easily portable to a new architecture is the
datastore access, which can be abstracted away easily enough, so you
could always chose to migrate from GAE to AWS at a later date.

Seb

2009/8/14 Francis Irving <[email protected]>:
> Mapumental is a website which shows contour maps of public transport
> travel times, house prices and other data. It's in closed beta.
>
> http://mapumental.channel4.com/
>
> It uses lots of CPU running the transport route finding for each
> postcode, and rendering the tiles as they are served.
>
> Before we can openly release it, we need to make it scale easily
> (say, on Amazon Web Services).
>
> Currently it is using
> * A PostgreSQL database to store the points behind the static datasets
> such as scenicness and house prices.
> * Binary files on NFS to store the generated datasets of travel times.
> PostgreSQL was too slow, and used too much memory, to load in the
> large number of rows that would be required (300,000 for each user entered
> postcode).
> * A rendered tile cache, containing PNG files on the NFS filesystem.
> * PostgreSQL for queueing the jobs for the transport route finder.
>
> We now want to:
> * make the site scale easily (on Amazon Web Service),
> * make it easy to add more data sets.
> We had problems with NFS, so I need something to replace the binary
> files in NFS and the tile cache. It might also be prudent to use
> something easier to scale than a PostgreSQL database, although I
> suspect the load on it would be low so perhaps it isn't a problem.
>
> So the new version of Mapumental that I'm currently plannning has to
> store:
>    a) cache of tiles rendered (some fairly generated rarely
>    and frequently accessed e.g. house prices, some not accessed
>    often compared to generation times, e.g. public transport route)
>    b) coordinates and values of arbitary point datasets (e.g.
>    school quality, asthma air quality, wind speed, route by
>    car to a particular postcode etc. etc.)
>
> I'm looking for good, open source, alternatives to NFS and PostgreSQL
> to do this. Distributed data stores and queueing systems.
>
> What should I look at? What can I trust?
>
> I've already surveyed the field, and have my own ideas about what to
> do, but would be interested if anyone here has some experience or
> views on any of the obvious technologies.
>
> I'd like it to be stable and mature, and realistically it would
> already be in a Debian package.
>
> Francis
>
> _______________________________________________
> Mailing list [email protected]
> Archive, settings, or unsubscribe:
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>



-- 
skype: seb.bacon
mobile: 07790 939224

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Reply via email to