[google-appengine] Replicating of AppEngine [was: Any past scenarios of data getting lost]

Maximillian Dornseif Tue, 03 May 2011 02:38:12 -0700

On May 2, 3:49 pm, Murali Krishna <[email protected]> wrote:
> Hello,
>
> I am writing an application to store some crucial data using datastore
> api. I cannot afford to loose not even single record of possible
> 10million records. Does Google promise that the data be stored without
> any loss? Do they mention this fact in the terms and conditions?


http://groups.google.com/group/google-appengine/msg/8a9a505e8aaee07a
"The short story is: We won't lose your data - we have a robust backup
and
recovery strategy"

Backup on AppEngine seems to be a big thing as feature requests (e.g.
http://is.gd/fQcfXM http://is.gd/KTp65k#) and Usenet (?) discussions
(e.g. http://is.gd/6ObYrY http://is.gd/vo7TFj#) show.

Regulatory requirements force many companies to have a written plan
for
disaster recovery. "Google just doesn't lose data" cant be the answer
here.

Below is my evaluation of the situation and solutions along with code
to
replicate datastore contents to Amazon S3:

## Disaster Recovery on AppEngine and of-site Replication

# In the following paragraphs we consider several disaster scenarios
and how
# to guard against them. Here we are only considering safety
(availability)
# issues, not Security (confidentiality) issues. Our Data is hosted on
Google
# AppEngine Servers which seem exclusively be controlled by Google
Inc. and
# exclusively hosted in the United States. This contributes to some
disaster
# recovery scenarios.

# 1. Due to some programming or administration Error on our side data
is wiped
#    out.
# 2. Due to some programming or administration Error on Googles side
data is
#    wiped out. Data may or may not
#    be restored by Google after some time (see "[unapplied writes]
[1]" in
#    2010 or the AWS EBS outage in 2011).
# 3. Due to some third party soft or hardware involvement data is
wiped out.
#    Think of student or coordinated physical attacks on datacenters.
# 4. Due to some contractual problems (e.g. we don't pay) data is
deliberately
#    wiped out by Google.
# 5. The US government or US court system decides to deny us access to
our data.
# 6. A disgruntled Admin decides to delete our data.
#
# In addition there are some desirable properties the replication
should have:
#
# 1. One copy of the data must be stored within the EU. This is a
requirement
#    for tax record keeping.
# 2. One copy of the data should be stored within Germany. This makes
tax
#    record keeping easier.
# 3. One copy of the data should be stored on site. This would ensure
#    availability even if our company can't pay any 3 rd parties for
storage
#    for some time.
# 4. The replication should involve only minimal administrative
resources.
#    I always keep this image in mind when designing that stuff:
#    http://static.23.nu/md/Pictures/ZZ573344DB.png
#    Especially I want to avoid cronjobs on unix machines which need
#    additional monitoring, patching, upgrading, disks, backups, etc.
#    If possible all should run on AppEngine.

# One thing the replicas don't need to provide is immediate access. As
long as
# the data and metadata is replicated somewhere and can be loaded into
an
# (possibly to write on demand) application we are fine. Several of
the
# scenarios above imply that we would not have access to AppEngine
# infrastructure and must rewrite our software anyhow. So direct
restore from
# the replicas is not needed.

# We decided not to use the [bulkloader][3] for backups. While the
bulkloader
# is a fine pice of software it seems to me that it can't be used for
# incremental backups. Also I'm reluctant to enable the `remote_api`
because
# technically this would enable every developer wit admin permissions
on the
# AppEngine to download our complete dataset "for testing". And then a
laptop
# gets lost/stolen ...

# I also would argue that an application with enabled `remote_api`
can't
# comply with any serious audit/bookkeeping standards. So we don't use
it.

# Currently we have no objects bigger than 1 MB. This will change when
we use
# the blobstore. Replication entities bigger than 1 MB will be
challenging
# since the `urlfetch` API only allows 1 MB per upload. Options for
storage we
# considered.

### Amazon S3 (Simple Storage Service)

# This was our first choice. Provides storage in the EU, is well known
and
# regarded and comes with a rich ecosystem. With [S3 multipart upload]
it
# would be possible to generate big files but unfortunately the part
size ust be
# 5 MB or more while with the urlfetch API we van write only 1 MB or
less. So
# this doesn't work. But for objects < 1 MB Amazon S3 is a fine
choice.

### GS (Google Storage) and the AppEngine blobstore

# Both services don't guard against most of the disaster scenarios
described
# above but still have some interesting properties which might make
them
# desirable ans an immediate step to generating replicas

# With [resumable uploads][5] Google Storage provides the ability to
generate
# very large files while still being bound to the 1 MB upload limit of
the
# urlfetch API. In theory I should also be able to use the new
# [file-like blobstore access][6] to write large files to the
blobstore. But
# there seems to be a [30 second time limit][7] on keeping the file
open and
# it seems to be impossible to open a existing file in append mode.

# For now we don't use Google Storage and the AppEngine blobstore
because they
# don't guard against most of our disaster scenarios.

### Dropbox and box.net

# We use Dropbox extensively and have plenty of storage with a "Team"
account.
# We already use Dropbox for file input and output to AppEngine
applications.
# Dropbox provides not only online storage but also syncing to local
machines.
# Installing a dropbox client on a local machine would provide us with
onside
# storage with minimal administrative hassle. Offside Storage within
the EU
# would be more work.

# Unfortunately the [public Dropbox API][8] does not provide a append
to file
# operation or something else to create files bigger than 1 MB from
AppEngine.
# The ecosystem for Dropbox Python libraries seems somewhat immature.

# I haven't looked to close into box.net, but [the box.net upload API]
[9]
# seems more or less have the same limitations as Dropbox.

# I also took a quick lock into Windows Azure Storage but I didn't
understand
# if and how I can use only the storage service.

### Rackspace Cloudfiles

# Cloudfiles is a storage offering provided by Rackspace in the United
Stated
# but also [in the united Kingdom by reacspace.co.uk][10]. And the
United
# Kingdom is in (mostly) the EU. Rackspace is pushing the Cloudfiles
API with
# the "OpenStack" initiative but there still seems to be no extensive
# ecosystem around the API. What is strange is the fact that Rackspace
US and
# Rackspace UK seem to have no unified billing and the like.

# The API would allow creating large files via "[Large Object
Creation /
# segmented upload][11]". To my understanding the `PUT` method
together with
# byte ranges would provide a way to append to a file and the Python
library
# for cloudfiles already [provides generator based upload][12] but it
seems to
# me the current implementation would not work this way on AppEngine.

### WebDAV (self hosted)

# WebDAV would allow to use `PUT` with byte ranges and therefore would
allow us
# to generate arbitrary large output files. But I found no ready to
use Python
# library supporting that and no hosted WebDAV provider offering Cloud
Scale
# and Cloud pricing. I want to avoid self-hosted servers.

## Choice of technology

# Currently the only services we felt comfortable with based on
technology and
# pricing where Amazon S3 and Rackspace Cloudfiles. Cloudfiles has the
better
# API for our requirements that would allow us to create very large
files.
# Amazon S3 has a much richer environment of Desktop and browser based
# utilities, FUSE filesystems etc. Therefor we decided for now to
focus on
# output smaller than 1 MB and start with using Amazon S3. We would
use one of
# the many desktop utilities to regularly sync data from Amazon S3 to
local
# on-site storage. (one Unix cronjob to monitor `:-(` )

# This approach will guard against all the described disaster
scenarios above.
# It should also guard against most data corruption scenarios because
most of
# our data structures are designed to be immutable: data is never
rewritten,
# instead a new version of the objects is created in the datastore.
The old
# version is kept for audit purposes.

# [1]: 
http://groups.google.com/group/google-appengine-downtime-notify/msg/e9414ee6493da6fb
# [3]: 
http://code.google.com/appengine/docs/python/tools/uploadingdata.html#Downloading_and_Uploading_All_Data
# [4]: 
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?uploadobjusingmpu.html
# [5]: http://code.google.com/apis/storage/docs/developer-guide.html#resumable
# [6]: 
http://code.google.com/appengine/docs/python/blobstore/overview.html#Writing_Files_to_the_Blobstore
# [7]: 
http://groups.google.com/group/google-appengine-python/browse_thread/thread/7c52e9397fb88ac7
# [8]: https://www.dropbox.com/developers
# [9]: http://developers.box.net/w/page/12923951/ApiFunction_Upload-and-Download
# [10]: http://www.rackspace.co.uk/cloud-hosting/cloud-files/
# [11]: http://docs.rackspacecloud.com/files/api/v1/cf-devguide-20110420.pdf
# [12]: 
https://github.com/rackspace/python-cloudfiles/blob/master/cloudfiles/storage_object.py#L429

## Implementation

# We want incremental replication running in fixed intervals (e.g.
every 15
# minutes). All data changed since the last replication should be
written to
# the external storage. All of our datastore Entities have an
`updated_at`
# property tat is set on entity update and creation. We use this to
select
# entities for replication.
# Data is stored as PDF and JSON to Amazon S3.

# The `boto` library has the tendency to flood the AppEngine log with
# usually uninteresting stuff therefore we supress logging.
logging.getLogger('boto').setLevel(logging.CRITICAL)

# To remember which entity was the last one replicated we use a
separate
# simple model storing only a timestamp.

class DateTimeConfig(db.Model):
    data = db.DateTimeProperty()

# We also use a Model for storing AWS access credentials.
# You need to set the Credentials before first use like this:

#    StrConfig.get_or_insert('aws_key_%s' % tenant, data='*key*')
#    StrConfig.get_or_insert('aws_secret_%s' % tenant,
data='*secret*')

class StrConfig(db.Model):
    data = db.StringProperty(default='')

# The replication is working incrementally and expected to be called
at
# regular intervals. Here we are only replicating Document entities
and
# related PDFs for a single tenant.
# The replication stops after 300 seconds by default. This is to make
sure
# that where are no race conditions between a long running replication
job
# and the next job started by cron. Such a race condition should not
result
# in data loss but is a waste of resources.
# We start of by the updated_at timestamp of the last entity
successfully
# replicated taken from `DateTimeConfig`. Alternatively the caller can
# provide a start timestamp. During the first run we start by default
at
# 2010-11-01.

def replicate_documents_s3(tenant, timelimit=300, startfrom=None):
    if not startfrom:
        startfrom = DateTimeConfig.get_or_insert('export_latest_%s' %
tenant,
 
data=datetime.datetime(2010, 11, 1)).data
    starttime = time.time()
    maxdate = startfrom

    # The connection to S3 is set up based on credentials found in the
    # datastore. If not set we use default values which can be updated
via
    # the datastore admin.
    # We assume that when connecting from GAE to AWS using SSL/TLS
    # is of little use.

    s3connection = boto.connect_s3(
        StrConfig.get_or_insert('aws_key_%s' % tenant,
data='*key*').data,
        StrConfig.get_or_insert('aws_secret_%s' % tenant,
data='*secret*').data,
        is_secure=False)

    # We create one bucket per month per tennant for replication.
While there
    # seems to be no practical limit on how many keys can be stored in
a S3
    # bucket, most frontend tools get very slow with more than a few
thousand
    # keys per bucket. We currently have less than 50.000 entities per
month.
    # If we had more entities we would probably be better of with
creating
    # a S3 bucket per day.
    bucket_name = 'dablageexport.%s-%s' % (tenant,
startfrom.strftime('%Y-%m'))
    try:
        s3connection.create_bucket(bucket_name, location=Location.EU)
    except boto.exception.S3CreateError:
        logging.info("S3 bucket %s already exists", bucket_name)
        pass

    # We start replicating from the `updated_at` timestamp of the last
entity
    # replicated. This results in the last entity being replicated
twice.
    # While this is a certain waste of resources it ensures that the
system
    # reliably replicates even if two entities have exactly the same
    # timestamp (which should never happen, due to the sub-millisecond
    # resolution) of the timestamp, but you never know.

    logging.info("archiving starting from %s to %s" % (startfrom,
bucket_name))
    docs = Dokument.all().filter('tenant =', tenant).filter(
                                 'updated_at >=',
startfrom).order('updated_at')

    # The first version of this code used iterator access to loop over
the
    # documents. Unfortunately we saw timeouts and strange errors
after about
    # 80 seconds. Using `fetch()` removed that issue and also nicely
limits
    # the number of documents archived per call.
    # This approach would fail if we had 250 or more entities with
exactly
    # the same timestamp. We just hope this will not happen.
    for doc in docs.fetch(250):
        # Prepare filenames.
        # We want to have no slashes and colons in filenames.
Unfortunately
        # we can't use `strftime()` because this would loose
microseconds.
        # Therefore we use `isoformat()` and `replace()`.
        # Finally we ensure the filename is not Unicode.

        # Since we use the updated_at property in the filename
rewritten
        # versions of the entity will have a different filename.
        akte_name = '-'.join(doc.akte.key().id_or_name().split('-')
[1:])
        updated_at = doc.updated_at.isoformat().replace('-',
'').replace(':', '')
        fname = "%s-%s-%s" % (akte_name, doc.designator, updated_at)
        fname = fname.encode('ascii', 'replace')
        # Serialize the entity as JSON. We use a separate file per
entity
        # which makes the code much simpler.
        jsondata = hujson.dumps(doc.as_dict())
        s3bucket = s3connection.get_bucket(bucket_name)
        s3bucket.new_key(fname +
'.json').set_contents_from_string(jsondata)
        # For every entity we have a separate entity containing a PDF.
Retrieve
        # that and store it to S3.
        # When reusing the same s3bucket as used when writing the JSON
files
        # got mixed up on the server. Creating a separate bucket
instance
        # solved that issue.
        # Since we use the designator in the filename and the
designator is
        # in fact the SHA-1 of the PDF, rewritten PDFs will have a
differen
        # filename.
        pdf = DokumentFile.get_by_key_name("%s-%s" % (tenant,
doc.designator)).data
        s3bucket = s3connection.get_bucket(bucket_name)
        s3bucket.new_key(fname + '.pdf').set_contents_from_string(pdf)
        # Remeber the data of the newest updated_at value.
        maxdate = max([maxdate, doc.updated_at])
        # If we have been running for more than `timelimit` seconds,
stop
        # replication.
        if time.time() - starttime > timelimit:
            break

    # Finally store `maxdate` into the datastore so we know where we
should
    # continue next time we are called.
    DateTimeConfig(key_name='export_latest_%s' % tenant,
data=maxdate).put()
    return maxdate


# HTTP-Request Handler to be called via cron or via a taskqueue.
# Being called via a regular request would impose the 30 second
request
# runtime limit which is undesirable. Running form a Task Queue
handler
# or from cron would give us 10 Minutes runtime.
# Currently we use a hardcoded tennant and call this handler every 10
minutes
# via cron.

class TaskReplicateDocumentsHandler(BasicHandler):
    def get(self):
        tenant = 'hudora.de'
        maxdate = replicate_documents_s3(tenant)
        logging.info("processed up to %s", maxdate)
        self.response.out.write(str(maxdate) + '\n')
    post = get


# HTTP-Request Handler to trigger replication via taskque. For testing
purposes.

class ExportDocumentsHandler(BasicHandler):
    def get(self):
        taskqueue.add(url='/automation/task/export', method='GET')
        self.response.out.write('ok\n')

## Evaluation

# We assume that the TaskReplicateDocumentsHandler is called every 10
minutes.
# We also assume that no more than 250 Documents in 10 minutes are
added to
# the application.

# The setup will ensure that as long as Google AppEngine and Amazon S3
are
# available all data older than 20 Minutes will exists on the S3
replica
# within the EU. Should Google AppEngine or Amazon S3 become
unavailable
# for some period of time and then become available again, replication
will
# automatically "catch up". No data is ever deleted by the replication
# process so that data loss in the primary system will not result in
data
# loss in the replica. Amazon guarantees 99.99999999% data durability.

# Regular mirroring of the S3 data to a on-site local storage will
ensure
# a third replica within Germany.

# From his third replica at regular intervals copies are created to
removable
# storage media. This media is stored at a different office inside a
save.

# This strategy will protect against all thread scenarios outlined
above.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Replicating of AppEngine [was: Any past scenarios of data getting lost]

Reply via email to