Hey there If you want to store megabytes of JSON in datastore and get it back from datastore into python already parsed, this post is for you.
I ran a couple of performance tests where I want to store a 4 MB json object in the datastore and then get it back at a later point and process it. There are several ways to do this. *Challenge 1) Serialization* You need to serialize your data. For this you can use several different libraries. JSON objects can be serialized using: the json lib, the cPickle lib or the marshal lib. (these are the libraries I'm aware of atm) *Challenge 2) Compression* If your serialized data doesn't fit into 1mb you need to shard your data over multiple datastore entities and manually build it together when loading the entities back. If you compress your serialized data and store it then, you have the cost of compression and decompression, but you have to fetch fewer datastore entities when you want to load your data and you have to write fewer datastore entities if you want to update your data if it sharded. *Solution for 1) Serialization:* cPickle is very slow. It's meant to serialize real objects and not just json. JSON is much faster, but compared to marshal it has no chance. *The python marshal library is **definitely the* *way to serialize JSON. *It has the best performance * * *Solution for 2) Compression:* For my use-case it makes absolutely sense to compress the data the marshal lib produces before storing it in datastore. I have gigabytes of JSON data. Compressing the data makes it about 5x smaller. Doing 5x fewer datastore operations definitely pays for the the time it takes to compress and decompress the data. There are several compression levels you can use to when using python's zlib. >From 1 (lowest compression, but fastest) to 9 (highest compression but slowest). During my tests I figured that the optimum is to *compress your serialized data using* *zlib with **level 1 compression*.* *Higher compression takes to much CPU and the result is only marginally smaller. Here are my test results: *cPickle ziplvl: 0* dump: 1.671010s load: 0.764567s size: 3297275 *cPickle ziplvl: 1* dump: 2.033570s load: 0.874783s size: 935327 *json ziplvl: 0* dump: 0.595903s load: 0.698307s size: 2321719 *json ziplvl: 1* dump: 0.667103s load: 0.795470s size: 458030 *marshal ziplvl: 0* dump: 0.118067s load: 0.314645s size: 2311342 *marshal ziplvl: 1* dump: 0.315362s load: 0.335677s size: 470956 *marshal ziplvl: 2* dump: 0.318787s load: 0.380117s size: 457196 *marshal ziplvl: 3* dump: 0.350247s load: 0.364908s size: 446085 *marshal ziplvl: 4* dump: 0.414658s load: 0.318973s size: 437764 *marshal ziplvl: 5* dump: 0.448890s load: 0.350013s size: 418712 *marshal ziplvl: 6* dump: 0.516882s load: 0.367595s size: 409947 *marshal ziplvl: 7* dump: 0.617210s load: 0.315827s size: 398354 *marshal ziplvl: 8* dump: 1.117032s load: 0.346452s size: 392332 *marshal ziplvl: 9* dump: 1.366547s load: 0.368925s size: 391921 The results do not include datastore operations, it's just about creating a blob that can be stored in the datastore and getting the parsed data back. The times of "dump" and "load" are seconds it takes to do this on a Google AppEngine F1 instances (600Mhz, 128mb RAM). I posted this email on my blog: http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html You can also comment there or on this email thread. Enjoy, -Andrin Here is the library i created an use: #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# from google.appengine.api import datastore_typesfrom google.appengine.ext import db import zlibimport marshal MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1 class JsonMarshalZipProperty(db.BlobProperty): """Stores a JSON serializable object using zlib and marshal in a db.Blob""" def default_value(self): return None def get_value_for_datastore(self, model_instance): value = self.__get__(model_instance, model_instance.__class__) if value is None: return None return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION), COMPRESSION_LEVEL)) def make_value_from_datastore(self, value): if value is not None: return marshal.loads(zlib.decompress(value)) return value data_type = datastore_types.Blob def validate(self, value): return value -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
