Hey there

If you want to store megabytes of JSON in datastore
and get it back from datastore into python already parsed,
this post is for you.

I ran a couple of performance tests where I want to store
a 4 MB json object in the datastore and then get it back at
a later point and process it.

There are several ways to do this.

*Challenge 1) Serialization*
You need to serialize your data.
For this you can use several different libraries.
JSON objects can be serialized using:
the json lib, the cPickle lib or the marshal lib.
(these are the libraries I'm aware of atm)

*Challenge 2) Compression*
If your serialized data doesn't fit into 1mb you need
to shard your data over multiple datastore entities and
manually build it together when loading the entities back.
If you compress your serialized data and store it then,
you have the cost of compression and decompression,
but you have to fetch fewer datastore entities when you
want to load your data and you have to write fewer
datastore entities if you want to update your data if it
sharded.

*Solution for 1) Serialization:*
cPickle is very slow. It's meant to serialize real
objects and not just json. JSON is much faster,
but compared to marshal it has no chance.
*The python marshal library is **definitely the*
*way to serialize JSON. *It has the best performance
*
*
*Solution for 2) Compression:*
For my use-case it makes absolutely sense to
compress the data the marshal lib produces
before storing it in datastore. I have gigabytes
of JSON data. Compressing the data makes
it about 5x smaller. Doing 5x fewer datastore
operations definitely pays for the the time it
takes to compress and decompress the data.
There are several compression levels you
can use to when using python's zlib.
>From 1 (lowest compression, but fastest)
to 9 (highest compression but slowest).
During my tests I figured that the optimum
is to *compress your serialized data using*
*zlib with **level 1 compression*.* *Higher
compression takes to much CPU and
the result is only marginally smaller.

Here are my test results:

*cPickle ziplvl: 0*


dump: 1.671010s

load: 0.764567s

size: 3297275

*cPickle ziplvl: 1*


dump: 2.033570s

load: 0.874783s

size: 935327

*json ziplvl: 0*


dump: 0.595903s

load: 0.698307s

size: 2321719

*json ziplvl: 1*


dump: 0.667103s

load: 0.795470s

size: 458030

*marshal ziplvl: 0*


dump: 0.118067s

load: 0.314645s

size: 2311342

*marshal ziplvl: 1*


dump: 0.315362s

load: 0.335677s

size: 470956

*marshal ziplvl: 2*


dump: 0.318787s

load: 0.380117s

size: 457196

*marshal ziplvl: 3*


dump: 0.350247s

load: 0.364908s

size: 446085

*marshal ziplvl: 4*


dump: 0.414658s

load: 0.318973s

size: 437764

*marshal ziplvl: 5*


dump: 0.448890s

load: 0.350013s

size: 418712

*marshal ziplvl: 6*


dump: 0.516882s

load: 0.367595s

size: 409947

*marshal ziplvl: 7*


dump: 0.617210s

load: 0.315827s

size: 398354

*marshal ziplvl: 8*


dump: 1.117032s

load: 0.346452s

size: 392332

*marshal ziplvl: 9*


dump: 1.366547s

load: 0.368925s

size: 391921

The results do not include datastore operations,
it's just about creating a blob that can be stored
in the datastore and getting the parsed data back.
The times of "dump" and "load" are seconds it takes
to do this on a Google AppEngine F1 instances
(600Mhz, 128mb RAM).

I posted this email on my blog:
http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html
You can also comment there or on this email thread.

Enjoy,
-Andrin

Here is the library i created an use:

 #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under
the Apache License, Version 2.0 (the "License");# you may not use this
file except in compliance with the License.# You may obtain a copy of
the License at##     http://www.apache.org/licenses/LICENSE-2.0##
Unless required by applicable law or agreed to in writing, software#
distributed under the License is distributed on an "AS IS" BASIS,#
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.# See the License for the specific language governing
permissions and# limitations under the License.#
from google.appengine.api import datastore_typesfrom
google.appengine.ext import db
import zlibimport marshal
MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1
class JsonMarshalZipProperty(db.BlobProperty):
  """Stores a JSON serializable object using zlib and marshal in a db.Blob"""

  def default_value(self):
    return None

  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is None:
      return None
    return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
                                 COMPRESSION_LEVEL))

  def make_value_from_datastore(self, value):
    if value is not None:
      return marshal.loads(zlib.decompress(value))
    return value

  data_type = datastore_types.Blob

  def validate(self, value):
    return value

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to