aschmid: The ndb BlobProperty has optional compression built in (see
ndb.model.BlobProperty). You could implement the MarshalProperty like this:
class MarshalProperty(BlobProperty):
def _to_base_type(self, value):
return marshal.dumps(value, MARSHAL_VERSION)
def _from_base_type(self, value):
return marshal.loads(value)
Then when you instantiate a property instance you would specify the
compressed option to enable compression:
prop = MarshalProperty(compressed=True)
The compressed option in BlobProperty is implemented in such a way that you
can turn it on an off and old values will still be read properly as the
_from_base_type() method in BlobProperty only decompresses the stored value
if it actually was compressed.
The BlobProperty uses the default compression level (and does not have an
option to change the compression level) so if you want to use level 1 (as
Andrin recommends) you would need to implement that in your own subclass.
On Monday, June 4, 2012 7:41:14 AM UTC-7, aschmid wrote:
>
> is this a valid implementation?
>
> class JsonMarshalZipProperty(ndb.BlobProperty):
>
> def _to_base_type(self, value):
> return zlib.compress(marshal.dumps(value, MARSHAL_VERSION))
>
> def _from_base_type(self, value):
> return marshal.loads(zlib.decompress(value))
>
>
>
> On Jun 4, 2012, at 9:49 AM, Andreas wrote:
>
> great. how would this look for the ndb package?
>
> On Jun 1, 2012, at 2:40 PM, Andrin von Rechenberg wrote:
>
> Hey there
>
> If you want to store megabytes of JSON in datastore
> and get it back from datastore into python already parsed,
> this post is for you.
>
> I ran a couple of performance tests where I want to store
> a 4 MB json object in the datastore and then get it back at
> a later point and process it.
>
> There are several ways to do this.
>
> *Challenge 1) Serialization*
> You need to serialize your data.
> For this you can use several different libraries.
> JSON objects can be serialized using:
> the json lib, the cPickle lib or the marshal lib.
> (these are the libraries I'm aware of atm)
>
> *Challenge 2) Compression*
> If your serialized data doesn't fit into 1mb you need
> to shard your data over multiple datastore entities and
> manually build it together when loading the entities back.
> If you compress your serialized data and store it then,
> you have the cost of compression and decompression,
> but you have to fetch fewer datastore entities when you
> want to load your data and you have to write fewer
> datastore entities if you want to update your data if it
> sharded.
>
> *Solution for 1) Serialization:*
> cPickle is very slow. It's meant to serialize real
> objects and not just json. JSON is much faster,
> but compared to marshal it has no chance.
> *The python marshal library is **definitely the*
> *way to serialize JSON. *It has the best performance
> *
> *
> *Solution for 2) Compression:*
> For my use-case it makes absolutely sense to
> compress the data the marshal lib produces
> before storing it in datastore. I have gigabytes
> of JSON data. Compressing the data makes
> it about 5x smaller. Doing 5x fewer datastore
> operations definitely pays for the the time it
> takes to compress and decompress the data.
> There are several compression levels you
> can use to when using python's zlib.
> From 1 (lowest compression, but fastest)
> to 9 (highest compression but slowest).
> During my tests I figured that the optimum
> is to *compress your serialized data using*
> *zlib with **level 1 compression*.* *Higher
> compression takes to much CPU and
> the result is only marginally smaller.
>
> Here are my test results:
> *cPickle ziplvl: 0*
>
> dump: 1.671010s
> load: 0.764567s
> size: 3297275
> *cPickle ziplvl: 1*
>
> dump: 2.033570s
> load: 0.874783s
> size: 935327
> *json ziplvl: 0*
>
> dump: 0.595903s
> load: 0.698307s
> size: 2321719
> *json ziplvl: 1*
>
> dump: 0.667103s
> load: 0.795470s
> size: 458030
> *marshal ziplvl: 0*
>
> dump: 0.118067s
> load: 0.314645s
> size: 2311342
> *marshal ziplvl: 1*
>
> dump: 0.315362s
> load: 0.335677s
> size: 470956
> *marshal ziplvl: 2*
>
> dump: 0.318787s
> load: 0.380117s
> size: 457196
> *marshal ziplvl: 3*
>
> dump: 0.350247s
> load: 0.364908s
> size: 446085
> *marshal ziplvl: 4*
>
> dump: 0.414658s
> load: 0.318973s
> size: 437764
> *marshal ziplvl: 5*
>
> dump: 0.448890s
> load: 0.350013s
> size: 418712
> *marshal ziplvl: 6*
>
> dump: 0.516882s
> load: 0.367595s
> size: 409947
> *marshal ziplvl: 7*
>
> dump: 0.617210s
> load: 0.315827s
> size: 398354
> *marshal ziplvl: 8*
>
> dump: 1.117032s
> load: 0.346452s
> size: 392332
> *marshal ziplvl: 9*
>
> dump: 1.366547s
> load: 0.368925s
> size: 391921
> The results do not include datastore operations,
> it's just about creating a blob that can be stored
> in the datastore and getting the parsed data back.
> The times of "dump" and "load" are seconds it takes
> to do this on a Google AppEngine F1 instances
> (600Mhz, 128mb RAM).
>
> I posted this email on my blog:
> http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html
> You can also comment there or on this email thread.
>
> Enjoy,
> -Andrin
>
> Here is the library i created an use:
>
> #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under the
> Apache License, Version 2.0 (the "License");# you may not use this file
> except in compliance with the License.# You may obtain a copy of the License
> at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by
> applicable law or agreed to in writing, software# distributed under the
> License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS
> OF ANY KIND, either express or implied.# See the License for the specific
> language governing permissions and# limitations under the License.#
> from google.appengine.api import datastore_typesfrom google.appengine.ext
> import db
> import zlibimport marshal
> MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1
> class JsonMarshalZipProperty(db.BlobProperty):
> """Stores a JSON serializable object using zlib and marshal in a db.Blob"""
>
> def default_value(self):
> return None
>
> def get_value_for_datastore(self, model_instance):
> value = self.__get__(model_instance, model_instance.__class__)
> if value is None:
> return None
> return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
> COMPRESSION_LEVEL))
>
> def make_value_from_datastore(self, value):
> if value is not None:
> return marshal.loads(zlib.decompress(value))
> return value
>
> data_type = datastore_types.Blob
>
> def validate(self, value):
> return value
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>
>
>
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/google-appengine/-/qKSg7YkFW5YJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.