Re: [google-appengine] Storing JSON efficiently in Datastore (in Python)

Andreas Mon, 04 Jun 2012 06:50:03 -0700

great. how would this look for the ndb package?

On Jun 1, 2012, at 2:40 PM, Andrin von Rechenberg wrote:


> Hey there
> 
> If you want to store megabytes of JSON in datastore
> and get it back from datastore into python already parsed, 
> this post is for you.
> 
> I ran a couple of performance tests where I want to store
> a 4 MB json object in the datastore and then get it back at
> a later point and process it.
> 
> There are several ways to do this.
> 
> Challenge 1) Serialization
> You need to serialize your data.
> For this you can use several different libraries.
> JSON objects can be serialized using:
> the json lib, the cPickle lib or the marshal lib.
> (these are the libraries I'm aware of atm)
> 
> Challenge 2) Compression
> If your serialized data doesn't fit into 1mb you need
> to shard your data over multiple datastore entities and
> manually build it together when loading the entities back.
> If you compress your serialized data and store it then,
> you have the cost of compression and decompression,
> but you have to fetch fewer datastore entities when you
> want to load your data and you have to write fewer
> datastore entities if you want to update your data if it
> sharded.
> 
> Solution for 1) Serialization:
> cPickle is very slow. It's meant to serialize real
> objects and not just json. JSON is much faster,
> but compared to marshal it has no chance.
> The python marshal library is definitely the
> way to serialize JSON. It has the best performance
> 
> Solution for 2) Compression:
> For my use-case it makes absolutely sense to
> compress the data the marshal lib produces
> before storing it in datastore. I have gigabytes
> of JSON data. Compressing the data makes
> it about 5x smaller. Doing 5x fewer datastore
> operations definitely pays for the the time it
> takes to compress and decompress the data.
> There are several compression levels you
> can use to when using python's zlib.
> From 1 (lowest compression, but fastest)
> to 9 (highest compression but slowest).
> During my tests I figured that the optimum
> is to compress your serialized data using
> zlib with level 1 compression. Higher
> compression takes to much CPU and
> the result is only marginally smaller.
> 
> Here are my test results:
> cPickle ziplvl: 0
> 
> dump: 1.671010s
> load: 0.764567s
> size: 3297275
> cPickle ziplvl: 1
> 
> dump: 2.033570s
> load: 0.874783s
> size: 935327
> json ziplvl: 0
> 
> dump: 0.595903s
> load: 0.698307s
> size: 2321719
> json ziplvl: 1
> 
> dump: 0.667103s
> load: 0.795470s
> size: 458030
> marshal ziplvl: 0
> 
> dump: 0.118067s
> load: 0.314645s
> size: 2311342
> marshal ziplvl: 1
> 
> dump: 0.315362s
> load: 0.335677s
> size: 470956
> marshal ziplvl: 2
> 
> dump: 0.318787s
> load: 0.380117s
> size: 457196
> marshal ziplvl: 3
> 
> dump: 0.350247s
> load: 0.364908s
> size: 446085
> marshal ziplvl: 4
> 
> dump: 0.414658s
> load: 0.318973s
> size: 437764
> marshal ziplvl: 5
> 
> dump: 0.448890s
> load: 0.350013s
> size: 418712
> marshal ziplvl: 6
> 
> dump: 0.516882s
> load: 0.367595s
> size: 409947
> marshal ziplvl: 7
> 
> dump: 0.617210s
> load: 0.315827s
> size: 398354
> marshal ziplvl: 8
> 
> dump: 1.117032s
> load: 0.346452s
> size: 392332
> marshal ziplvl: 9
> 
> dump: 1.366547s
> load: 0.368925s
> size: 391921
> The results do not include datastore operations,
> it's just about creating a blob that can be stored
> in the datastore and getting the parsed data back.
> The times of "dump" and "load" are seconds it takes
> to do this on a Google AppEngine F1 instances
> (600Mhz, 128mb RAM).
> 
> I posted this email on my blog: 
> http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html
> You can also comment there or on this email thread.
> 
> Enjoy,
> -Andrin
> 
> Here is the library i created an use:
> 
> #!/usr/bin/env python
> #
> # Copyright 2012 MiuMeet AG
> #
> # Licensed under the Apache License, Version 2.0 (the "License");
> # you may not use this file except in compliance with the License.
> # You may obtain a copy of the License at
> #
> #     http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
> 
> from google.appengine.api import datastore_types
> from google.appengine.ext import db
> 
> import zlib
> import marshal
> 
> MARSHAL_VERSION = 2
> COMPRESSION_LEVEL = 1
> 
> class JsonMarshalZipProperty(db.BlobProperty):
>   """Stores a JSON serializable object using zlib and marshal in a db.Blob"""
> 
>   def default_value(self):
>     return None
>   
>   def get_value_for_datastore(self, model_instance):
>     value = self.__get__(model_instance, model_instance.__class__)
>     if value is None:
>       return None
>     return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
>                                  COMPRESSION_LEVEL))
> 
>   def make_value_from_datastore(self, value):
>     if value is not None:
>       return marshal.loads(zlib.decompress(value))
>     return value
> 
>   data_type = datastore_types.Blob
>   
>   def validate(self, value):
>     return value
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Storing JSON efficiently in Datastore (in Python)

Reply via email to