The type of data being serialized certainly affects how much faster marshal 
is. When testing with just a string approximately 1MB in size marshal was 
around 10 times as fast as JSON but only about 10% faster than pickle. With 
a dict of dicts of integers (around 1MB when serialized with marshal) I 
found that pickle was about 50% faster than JSON but marshal was around 200 
times faster than pickle!

If I run some more empirical tests I will share some real numbers. What 
sort of data were you using in your tests Andrin?


On Friday, June 1, 2012 11:40:38 AM UTC-7, Andrin von Rechenberg wrote:
>
> Hey there
>
> If you want to store megabytes of JSON in datastore
> and get it back from datastore into python already parsed, 
> this post is for you.
>
> I ran a couple of performance tests where I want to store
> a 4 MB json object in the datastore and then get it back at
> a later point and process it.
>
> There are several ways to do this.
>
> *Challenge 1) Serialization*
> You need to serialize your data.
> For this you can use several different libraries.
> JSON objects can be serialized using:
> the json lib, the cPickle lib or the marshal lib.
> (these are the libraries I'm aware of atm)
>
> *Challenge 2) Compression*
> If your serialized data doesn't fit into 1mb you need
> to shard your data over multiple datastore entities and
> manually build it together when loading the entities back.
> If you compress your serialized data and store it then,
> you have the cost of compression and decompression,
> but you have to fetch fewer datastore entities when you
> want to load your data and you have to write fewer
> datastore entities if you want to update your data if it
> sharded.
>
> *Solution for 1) Serialization:*
> cPickle is very slow. It's meant to serialize real
> objects and not just json. JSON is much faster,
> but compared to marshal it has no chance.
> *The python marshal library is **definitely the*
> *way to serialize JSON. *It has the best performance
> *
> *
> *Solution for 2) Compression:*
> For my use-case it makes absolutely sense to
> compress the data the marshal lib produces
> before storing it in datastore. I have gigabytes
> of JSON data. Compressing the data makes
> it about 5x smaller. Doing 5x fewer datastore
> operations definitely pays for the the time it
> takes to compress and decompress the data.
> There are several compression levels you
> can use to when using python's zlib.
> From 1 (lowest compression, but fastest)
> to 9 (highest compression but slowest).
> During my tests I figured that the optimum
> is to *compress your serialized data using*
> *zlib with **level 1 compression*.* *Higher
> compression takes to much CPU and
> the result is only marginally smaller.
>
> Here are my test results:
>
> *cPickle ziplvl: 0*
>
>
> dump: 1.671010s
>
> load: 0.764567s
>
> size: 3297275
>
> *cPickle ziplvl: 1*
>
>
> dump: 2.033570s
>
> load: 0.874783s
>
> size: 935327
>
> *json ziplvl: 0*
>
>
> dump: 0.595903s
>
> load: 0.698307s
>
> size: 2321719
>
> *json ziplvl: 1*
>
>
> dump: 0.667103s
>
> load: 0.795470s
>
> size: 458030
>
> *marshal ziplvl: 0*
>
>
> dump: 0.118067s
>
> load: 0.314645s
>
> size: 2311342
>
> *marshal ziplvl: 1*
>
>
> dump: 0.315362s
>
> load: 0.335677s
>
> size: 470956
>
> *marshal ziplvl: 2*
>
>
> dump: 0.318787s
>
> load: 0.380117s
>
> size: 457196
>
> *marshal ziplvl: 3*
>
>
> dump: 0.350247s
>
> load: 0.364908s
>
> size: 446085
>
> *marshal ziplvl: 4*
>
>
> dump: 0.414658s
>
> load: 0.318973s
>
> size: 437764
>
> *marshal ziplvl: 5*
>
>
> dump: 0.448890s
>
> load: 0.350013s
>
> size: 418712
>
> *marshal ziplvl: 6*
>
>
> dump: 0.516882s
>
> load: 0.367595s
>
> size: 409947
>
> *marshal ziplvl: 7*
>
>
> dump: 0.617210s
>
> load: 0.315827s
>
> size: 398354
>
> *marshal ziplvl: 8*
>
>
> dump: 1.117032s
>
> load: 0.346452s
>
> size: 392332
>
> *marshal ziplvl: 9*
>
>
> dump: 1.366547s
>
> load: 0.368925s
>
> size: 391921
>
> The results do not include datastore operations,
> it's just about creating a blob that can be stored
> in the datastore and getting the parsed data back.
> The times of "dump" and "load" are seconds it takes
> to do this on a Google AppEngine F1 instances
> (600Mhz, 128mb RAM).
>
> I posted this email on my blog: 
> http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html
> You can also comment there or on this email thread.
>
> Enjoy,
> -Andrin
>
> Here is the library i created an use:
>
>  #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under the 
> Apache License, Version 2.0 (the "License");# you may not use this file 
> except in compliance with the License.# You may obtain a copy of the License 
> at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by 
> applicable law or agreed to in writing, software# distributed under the 
> License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS 
> OF ANY KIND, either express or implied.# See the License for the specific 
> language governing permissions and# limitations under the License.#
> from google.appengine.api import datastore_typesfrom google.appengine.ext 
> import db
> import zlibimport marshal
> MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1
> class JsonMarshalZipProperty(db.BlobProperty):
>   """Stores a JSON serializable object using zlib and marshal in a db.Blob"""
>
>   def default_value(self):
>     return None
>   
>   def get_value_for_datastore(self, model_instance):
>     value = self.__get__(model_instance, model_instance.__class__)
>     if value is None:
>       return None
>     return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
>                                  COMPRESSION_LEVEL))
>
>   def make_value_from_datastore(self, value):
>     if value is not None:
>       return marshal.loads(zlib.decompress(value))
>     return value
>
>   data_type = datastore_types.Blob
>   
>   def validate(self, value):
>     return value
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/E7uCryqsk2QJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to