great. how would this look for the ndb package? On Jun 1, 2012, at 2:40 PM, Andrin von Rechenberg wrote:
> Hey there > > If you want to store megabytes of JSON in datastore > and get it back from datastore into python already parsed, > this post is for you. > > I ran a couple of performance tests where I want to store > a 4 MB json object in the datastore and then get it back at > a later point and process it. > > There are several ways to do this. > > Challenge 1) Serialization > You need to serialize your data. > For this you can use several different libraries. > JSON objects can be serialized using: > the json lib, the cPickle lib or the marshal lib. > (these are the libraries I'm aware of atm) > > Challenge 2) Compression > If your serialized data doesn't fit into 1mb you need > to shard your data over multiple datastore entities and > manually build it together when loading the entities back. > If you compress your serialized data and store it then, > you have the cost of compression and decompression, > but you have to fetch fewer datastore entities when you > want to load your data and you have to write fewer > datastore entities if you want to update your data if it > sharded. > > Solution for 1) Serialization: > cPickle is very slow. It's meant to serialize real > objects and not just json. JSON is much faster, > but compared to marshal it has no chance. > The python marshal library is definitely the > way to serialize JSON. It has the best performance > > Solution for 2) Compression: > For my use-case it makes absolutely sense to > compress the data the marshal lib produces > before storing it in datastore. I have gigabytes > of JSON data. Compressing the data makes > it about 5x smaller. Doing 5x fewer datastore > operations definitely pays for the the time it > takes to compress and decompress the data. > There are several compression levels you > can use to when using python's zlib. > From 1 (lowest compression, but fastest) > to 9 (highest compression but slowest). > During my tests I figured that the optimum > is to compress your serialized data using > zlib with level 1 compression. Higher > compression takes to much CPU and > the result is only marginally smaller. > > Here are my test results: > cPickle ziplvl: 0 > > dump: 1.671010s > load: 0.764567s > size: 3297275 > cPickle ziplvl: 1 > > dump: 2.033570s > load: 0.874783s > size: 935327 > json ziplvl: 0 > > dump: 0.595903s > load: 0.698307s > size: 2321719 > json ziplvl: 1 > > dump: 0.667103s > load: 0.795470s > size: 458030 > marshal ziplvl: 0 > > dump: 0.118067s > load: 0.314645s > size: 2311342 > marshal ziplvl: 1 > > dump: 0.315362s > load: 0.335677s > size: 470956 > marshal ziplvl: 2 > > dump: 0.318787s > load: 0.380117s > size: 457196 > marshal ziplvl: 3 > > dump: 0.350247s > load: 0.364908s > size: 446085 > marshal ziplvl: 4 > > dump: 0.414658s > load: 0.318973s > size: 437764 > marshal ziplvl: 5 > > dump: 0.448890s > load: 0.350013s > size: 418712 > marshal ziplvl: 6 > > dump: 0.516882s > load: 0.367595s > size: 409947 > marshal ziplvl: 7 > > dump: 0.617210s > load: 0.315827s > size: 398354 > marshal ziplvl: 8 > > dump: 1.117032s > load: 0.346452s > size: 392332 > marshal ziplvl: 9 > > dump: 1.366547s > load: 0.368925s > size: 391921 > The results do not include datastore operations, > it's just about creating a blob that can be stored > in the datastore and getting the parsed data back. > The times of "dump" and "load" are seconds it takes > to do this on a Google AppEngine F1 instances > (600Mhz, 128mb RAM). > > I posted this email on my blog: > http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html > You can also comment there or on this email thread. > > Enjoy, > -Andrin > > Here is the library i created an use: > > #!/usr/bin/env python > # > # Copyright 2012 MiuMeet AG > # > # Licensed under the Apache License, Version 2.0 (the "License"); > # you may not use this file except in compliance with the License. > # You may obtain a copy of the License at > # > # http://www.apache.org/licenses/LICENSE-2.0 > # > # Unless required by applicable law or agreed to in writing, software > # distributed under the License is distributed on an "AS IS" BASIS, > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > # See the License for the specific language governing permissions and > # limitations under the License. > # > > from google.appengine.api import datastore_types > from google.appengine.ext import db > > import zlib > import marshal > > MARSHAL_VERSION = 2 > COMPRESSION_LEVEL = 1 > > class JsonMarshalZipProperty(db.BlobProperty): > """Stores a JSON serializable object using zlib and marshal in a db.Blob""" > > def default_value(self): > return None > > def get_value_for_datastore(self, model_instance): > value = self.__get__(model_instance, model_instance.__class__) > if value is None: > return None > return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION), > COMPRESSION_LEVEL)) > > def make_value_from_datastore(self, value): > if value is not None: > return marshal.loads(zlib.decompress(value)) > return value > > data_type = datastore_types.Blob > > def validate(self, value): > return value > > > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
