The type of data being serialized certainly affects how much faster marshal is. When testing with just a string approximately 1MB in size marshal was around 10 times as fast as JSON but only about 10% faster than pickle. With a dict of dicts of integers (around 1MB when serialized with marshal) I found that pickle was about 50% faster than JSON but marshal was around 200 times faster than pickle!
If I run some more empirical tests I will share some real numbers. What sort of data were you using in your tests Andrin? On Friday, June 1, 2012 11:40:38 AM UTC-7, Andrin von Rechenberg wrote: > > Hey there > > If you want to store megabytes of JSON in datastore > and get it back from datastore into python already parsed, > this post is for you. > > I ran a couple of performance tests where I want to store > a 4 MB json object in the datastore and then get it back at > a later point and process it. > > There are several ways to do this. > > *Challenge 1) Serialization* > You need to serialize your data. > For this you can use several different libraries. > JSON objects can be serialized using: > the json lib, the cPickle lib or the marshal lib. > (these are the libraries I'm aware of atm) > > *Challenge 2) Compression* > If your serialized data doesn't fit into 1mb you need > to shard your data over multiple datastore entities and > manually build it together when loading the entities back. > If you compress your serialized data and store it then, > you have the cost of compression and decompression, > but you have to fetch fewer datastore entities when you > want to load your data and you have to write fewer > datastore entities if you want to update your data if it > sharded. > > *Solution for 1) Serialization:* > cPickle is very slow. It's meant to serialize real > objects and not just json. JSON is much faster, > but compared to marshal it has no chance. > *The python marshal library is **definitely the* > *way to serialize JSON. *It has the best performance > * > * > *Solution for 2) Compression:* > For my use-case it makes absolutely sense to > compress the data the marshal lib produces > before storing it in datastore. I have gigabytes > of JSON data. Compressing the data makes > it about 5x smaller. Doing 5x fewer datastore > operations definitely pays for the the time it > takes to compress and decompress the data. > There are several compression levels you > can use to when using python's zlib. > From 1 (lowest compression, but fastest) > to 9 (highest compression but slowest). > During my tests I figured that the optimum > is to *compress your serialized data using* > *zlib with **level 1 compression*.* *Higher > compression takes to much CPU and > the result is only marginally smaller. > > Here are my test results: > > *cPickle ziplvl: 0* > > > dump: 1.671010s > > load: 0.764567s > > size: 3297275 > > *cPickle ziplvl: 1* > > > dump: 2.033570s > > load: 0.874783s > > size: 935327 > > *json ziplvl: 0* > > > dump: 0.595903s > > load: 0.698307s > > size: 2321719 > > *json ziplvl: 1* > > > dump: 0.667103s > > load: 0.795470s > > size: 458030 > > *marshal ziplvl: 0* > > > dump: 0.118067s > > load: 0.314645s > > size: 2311342 > > *marshal ziplvl: 1* > > > dump: 0.315362s > > load: 0.335677s > > size: 470956 > > *marshal ziplvl: 2* > > > dump: 0.318787s > > load: 0.380117s > > size: 457196 > > *marshal ziplvl: 3* > > > dump: 0.350247s > > load: 0.364908s > > size: 446085 > > *marshal ziplvl: 4* > > > dump: 0.414658s > > load: 0.318973s > > size: 437764 > > *marshal ziplvl: 5* > > > dump: 0.448890s > > load: 0.350013s > > size: 418712 > > *marshal ziplvl: 6* > > > dump: 0.516882s > > load: 0.367595s > > size: 409947 > > *marshal ziplvl: 7* > > > dump: 0.617210s > > load: 0.315827s > > size: 398354 > > *marshal ziplvl: 8* > > > dump: 1.117032s > > load: 0.346452s > > size: 392332 > > *marshal ziplvl: 9* > > > dump: 1.366547s > > load: 0.368925s > > size: 391921 > > The results do not include datastore operations, > it's just about creating a blob that can be stored > in the datastore and getting the parsed data back. > The times of "dump" and "load" are seconds it takes > to do this on a Google AppEngine F1 instances > (600Mhz, 128mb RAM). > > I posted this email on my blog: > http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html > You can also comment there or on this email thread. > > Enjoy, > -Andrin > > Here is the library i created an use: > > #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under the > Apache License, Version 2.0 (the "License");# you may not use this file > except in compliance with the License.# You may obtain a copy of the License > at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by > applicable law or agreed to in writing, software# distributed under the > License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS > OF ANY KIND, either express or implied.# See the License for the specific > language governing permissions and# limitations under the License.# > from google.appengine.api import datastore_typesfrom google.appengine.ext > import db > import zlibimport marshal > MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1 > class JsonMarshalZipProperty(db.BlobProperty): > """Stores a JSON serializable object using zlib and marshal in a db.Blob""" > > def default_value(self): > return None > > def get_value_for_datastore(self, model_instance): > value = self.__get__(model_instance, model_instance.__class__) > if value is None: > return None > return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION), > COMPRESSION_LEVEL)) > > def make_value_from_datastore(self, value): > if value is not None: > return marshal.loads(zlib.decompress(value)) > return value > > data_type = datastore_types.Blob > > def validate(self, value): > return value > > > > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/E7uCryqsk2QJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
