Marshal also has versioning. I hardcoded version 2 and am hopping that it
will be forward compatible.
If it's not, then oh well I need to get the data out first and reencode it.
but that's a small burden compared
to the amount of speed i get.
I was using python 2.7 with cPickle.
I don't think anything can beat marshal. Marshal is used by Python
internally when serializing data-structures
in compiled code (the pyc files). So it's critical for python's performance
and if there would be something faster
than marshal, python would def use it for exactly this case.
the test data was of this form
[
[ ~8 bytes string,
~10 bytes string,
{ about 6 key value pairs with up to 200 bytes }
] ,
...
...
...
]
Cheers,
-Andrin
On Sat, Jun 2, 2012 at 3:43 PM, Bryce Cutt <[email protected]> wrote:
> The type of data being serialized certainly affects how much faster
> marshal is. When testing with just a string approximately 1MB in size
> marshal was around 10 times as fast as JSON but only about 10% faster than
> pickle. With a dict of dicts of integers (around 1MB when serialized with
> marshal) I found that pickle was about 50% faster than JSON but marshal was
> around 200 times faster than pickle!
>
> If I run some more empirical tests I will share some real numbers. What
> sort of data were you using in your tests Andrin?
>
>
> On Friday, June 1, 2012 11:40:38 AM UTC-7, Andrin von Rechenberg wrote:
>>
>> Hey there
>>
>> If you want to store megabytes of JSON in datastore
>> and get it back from datastore into python already parsed,
>> this post is for you.
>>
>> I ran a couple of performance tests where I want to store
>> a 4 MB json object in the datastore and then get it back at
>> a later point and process it.
>>
>> There are several ways to do this.
>>
>> *Challenge 1) Serialization*
>> You need to serialize your data.
>> For this you can use several different libraries.
>> JSON objects can be serialized using:
>> the json lib, the cPickle lib or the marshal lib.
>> (these are the libraries I'm aware of atm)
>>
>> *Challenge 2) Compression*
>> If your serialized data doesn't fit into 1mb you need
>> to shard your data over multiple datastore entities and
>> manually build it together when loading the entities back.
>> If you compress your serialized data and store it then,
>> you have the cost of compression and decompression,
>> but you have to fetch fewer datastore entities when you
>> want to load your data and you have to write fewer
>> datastore entities if you want to update your data if it
>> sharded.
>>
>> *Solution for 1) Serialization:*
>> cPickle is very slow. It's meant to serialize real
>> objects and not just json. JSON is much faster,
>> but compared to marshal it has no chance.
>> *The python marshal library is **definitely the*
>> *way to serialize JSON. *It has the best performance
>> *
>> *
>> *Solution for 2) Compression:*
>> For my use-case it makes absolutely sense to
>> compress the data the marshal lib produces
>> before storing it in datastore. I have gigabytes
>> of JSON data. Compressing the data makes
>> it about 5x smaller. Doing 5x fewer datastore
>> operations definitely pays for the the time it
>> takes to compress and decompress the data.
>> There are several compression levels you
>> can use to when using python's zlib.
>> From 1 (lowest compression, but fastest)
>> to 9 (highest compression but slowest).
>> During my tests I figured that the optimum
>> is to *compress your serialized data using*
>> *zlib with **level 1 compression*.* *Higher
>> compression takes to much CPU and
>> the result is only marginally smaller.
>>
>> Here are my test results:
>>
>> *cPickle ziplvl: 0*
>>
>>
>> dump: 1.671010s
>>
>> load: 0.764567s
>>
>> size: 3297275
>>
>> *cPickle ziplvl: 1*
>>
>>
>> dump: 2.033570s
>>
>> load: 0.874783s
>>
>> size: 935327
>>
>> *json ziplvl: 0*
>>
>>
>> dump: 0.595903s
>>
>> load: 0.698307s
>>
>> size: 2321719
>>
>> *json ziplvl: 1*
>>
>>
>> dump: 0.667103s
>>
>> load: 0.795470s
>>
>> size: 458030
>>
>> *marshal ziplvl: 0*
>>
>>
>> dump: 0.118067s
>>
>> load: 0.314645s
>>
>> size: 2311342
>>
>> *marshal ziplvl: 1*
>>
>>
>> dump: 0.315362s
>>
>> load: 0.335677s
>>
>> size: 470956
>>
>> *marshal ziplvl: 2*
>>
>>
>> dump: 0.318787s
>>
>> load: 0.380117s
>>
>> size: 457196
>>
>> *marshal ziplvl: 3*
>>
>>
>> dump: 0.350247s
>>
>> load: 0.364908s
>>
>> size: 446085
>>
>> *marshal ziplvl: 4*
>>
>>
>> dump: 0.414658s
>>
>> load: 0.318973s
>>
>> size: 437764
>>
>> *marshal ziplvl: 5*
>>
>>
>> dump: 0.448890s
>>
>> load: 0.350013s
>>
>> size: 418712
>>
>> *marshal ziplvl: 6*
>>
>>
>> dump: 0.516882s
>>
>> load: 0.367595s
>>
>> size: 409947
>>
>> *marshal ziplvl: 7*
>>
>>
>> dump: 0.617210s
>>
>> load: 0.315827s
>>
>> size: 398354
>>
>> *marshal ziplvl: 8*
>>
>>
>> dump: 1.117032s
>>
>> load: 0.346452s
>>
>> size: 392332
>>
>> *marshal ziplvl: 9*
>>
>>
>> dump: 1.366547s
>>
>> load: 0.368925s
>>
>> size: 391921
>>
>> The results do not include datastore operations,
>> it's just about creating a blob that can be stored
>> in the datastore and getting the parsed data back.
>> The times of "dump" and "load" are seconds it takes
>> to do this on a Google AppEngine F1 instances
>> (600Mhz, 128mb RAM).
>>
>> I posted this email on my blog: http://devblog.miumeet.**
>> com/2012/06/storing-json-**efficiently-in-python-on.html<http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html>
>> You can also comment there or on this email thread.
>>
>> Enjoy,
>> -Andrin
>>
>> Here is the library i created an use:
>>
>> #!/usr/bin/env python## Copyright 2012 MiuMeet AG## Licensed under the
>> Apache License, Version 2.0 (the "License");# you may not use this file
>> except in compliance with the License.# You may obtain a copy of the License
>> at## http://www.apache.org/**licenses/LICENSE-2.0
>> <http://www.apache.org/licenses/LICENSE-2.0>## Unless required by applicable
>> law or agreed to in writing, software# distributed under the License is
>> distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY
>> KIND, either express or implied.# See the License for the specific language
>> governing permissions and# limitations under the License.#
>> from google.appengine.api import datastore_typesfrom google.appengine.ext
>> import db
>> import zlibimport marshal
>> MARSHAL_VERSION = 2COMPRESSION_LEVEL = 1
>> class JsonMarshalZipProperty(db.Blob**Property):
>> """Stores a JSON serializable object using zlib and marshal in a db.Blob"""
>>
>> def default_value(self):
>> return None
>>
>> def get_value_for_datastore(self, model_instance):
>> value = self.__get__(model_instance, model_instance.__class__)
>> if value is None:
>> return None
>> return db.Blob(zlib.compress(marshal.**dumps(value, MARSHAL_VERSION),
>> COMPRESSION_LEVEL))
>>
>> def make_value_from_datastore(self**, value):
>> if value is not None:
>> return marshal.loads(zlib.decompress(**value))
>> return value
>>
>> data_type = datastore_types.Blob
>>
>> def validate(self, value):
>> return value
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/E7uCryqsk2QJ.
>
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.