[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

Filipe Manana (JIRA) Wed, 16 Mar 2011 14:48:58 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007702#comment-13007702
 ]


Filipe Manana commented on COUCHDB-1092:
----------------------------------------

I made another small experiment that basically it keeps the serialized EJson in 
the #doc.body records, to avoid the memory copying when passing them between 
processes.

Here follow some results:


term_to_binary options [{compressed, 9}, {minor_version, 1}]

100 000 11Kb docs with many floats

database size 311 Mb

$ time curl 
http://localhost:5984/testdb_ejson_bins_9/_design/test/_view/simple?limit=1
{"total_rows":100000,"offset":0,"rows":[
{"id":"00001ef7-ab55-4d07-93da-7e368fb03ef9","key":null,"value":"2fQUbzRUax4A"}
]}

real    13m43.615s
user    0m0.012s
sys     0m0.020s

term_to_binary options [{compressed, 1}, {minor_version, 1}]

100 000 11Kb docs with many floats

database size 315 Mb


$ time curl 
http://localhost:5984/testdb_ejson_bins_1/_design/test/_view/simple?limit=1
{"total_rows":100000,"offset":0,"rows":[
{"id":"00001ef7-ab55-4d07-93da-7e368fb03ef9","key":null,"value":"2fQUbzRUax4A"}
]}

real    13m1.544s
user    0m0.012s
sys     0m0.020s


With branch which keeps the raw json binaries in the #doc records, it takes 
about 4 minutes to generate the same view from scratch, and database size is 
297Mb (very small difference).
Trunk takes about 18 minutes for the view generation and database file size is 
2202 Mb.

Also, made another experiment that removes all the string escaping in the JSON 
encoder, since all my test data only uses 7 bit Ascii characters, and to see if 
this was the bottleneck, since the decoder validates each character in a 
string/binary.
The patch can be found at:  http://friendpaste.com/7DRPbUVVLH3AWxKIsUOx84

View generation time didn't get much better however:

$ curl -X PUT http://localhost:5984/testdb_ejson_bins_1/_design/test -d 
@json_test_ddoc.json
{"ok":true,"id":"_design/test","rev":"1-02ac87c42b2e623f7adeaded381d2c2a"}
fdmanana 20:36:50 ~/git/hub/couchdb (ejon_bins)> time curl 
http://localhost:5984/testdb1/_design/test/_view/simple?limit=1
{"total_rows":100000,"offset":0,"rows":[
{"id":"00003297-7e7a-4028-985f-e5ae3c784e3a","key":null,"value":"2fQUbzRUax4A"}
]}

real    12m48.946s
user    0m0.000s
sys     0m0.024s


> Storing documents bodies as raw JSON binaries instead of serialized JSON terms
> ------------------------------------------------------------------------------
>
>                 Key: COUCHDB-1092
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>            Reporter: Filipe Manana
>            Assignee: Filipe Manana
>
> Currently we store documents as Erlang serialized (via the term_to_binary/1 
> BIF) EJSON.
> The proposed patch changes the database file format so that instead of 
> storing serialized
> EJSON document bodies, it stores raw JSON binaries.
> The github branch is at:  
> https://github.com/fdmanana/couchdb/tree/raw_json_docs
> Advantages:
> * what we write to disk is much smaller - a raw JSON binary can easily get up 
> to 50% smaller
>   (at least according to the tests I did)
> * when serving documents to a client we no longer need to JSON encode the 
> document body
>   read from the disk - this applies to individual document requests, view 
> queries with
>   ?include_docs=true, pull and push replications, and possibly other use 
> cases.
>   We just grab its body and prepend the _id, _rev and all the necessary 
> metadata fields
>   (this is via simple Erlang binary operations)
> * we avoid the EJSON term copying between request handlers and the db updater 
> processes,
>   between the work queues and the view updater process, between replicator 
> processes, etc
> * before sending a document to the JavaScript view server, we no longer need 
> to convert it
>   from EJSON to JSON
> The changes done to the document write workflow are minimalist - after JSON 
> decoding the
> document's JSON into EJSON and removing the metadata top level fields (_id, 
> _rev, etc), it
> JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
> course but it
> brings 2 advantages:
> 1) we avoid the EJSON copy between the request process and the database 
> updater process -
>    for any realistic document size (4kb or more) this can be very expensive, 
> specially
>    when there are many nested structures (lists inside objects inside lists, 
> etc)
> 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
> TheThingToWrite])
>    and then write the result to the file. A term_to_binary call with a binary 
> as the input
>    is very fast compared to a term_to_binary call with EJSON as input (or 
> some other nested
>    structure)
> I think both compensate the JSON encoding after the separation of meta data 
> fields and non-meta data fields.
> The following relaximation graph, for documents with sizes of 4Kb, shows a 
> significant
> performance increase both for writes and reads - especially reads.   
> http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
> I've also made a few tests to see how much the improvement is when querying a 
> view, for the
> first time, without ?stale=ok. The size difference of the databases (after 
> compaction) is
> also very significant - this change can reduce the size at least 50% in 
> common cases.
> The test databases were created in an instance built from that experimental 
> branch.
> Then they were replicated into a CouchDB instance built from the current 
> trunk.
> At the end both databases were compacted (to fairly compare their final 
> sizes).
> The databases contain the following view:
> {
>     "_id": "_design/test",
>     "language": "javascript",
>     "views": {
>         "simple": {
>             "map": "function(doc) { emit(doc.float1, doc.strings[1]); }"
>         }
>     }
> }
> ## Database with 500 000 docs of 2.5Kb each
> Document template is at:  
> https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json
> Sizes (branch vs trunk):
> $ du -m couchdb/tmp/lib/disk_json_test.couch 
> 1996  couchdb/tmp/lib/disk_json_test.couch
> $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 
> 2693  couchdb-trunk/tmp/lib/disk_ejson_test.couch
> Time, from a user's perpective, to build the view index from scratch:
> $ time curl 
> http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1
> {"total_rows":500000,"offset":0,"rows":[
> {"id":"0000076a-c1ae-4999-b508-c03f4d0620c5","key":null,"value":"wfxuF3N8XEK6"}
> ]}
> real  6m6.740s
> user  0m0.016s
> sys   0m0.008s
> $ time curl 
> http://localhost:5985/disk_ejson_test/_design/test/_view/simple?limit=1
> {"total_rows":500000,"offset":0,"rows":[
> {"id":"0000076a-c1ae-4999-b508-c03f4d0620c5","key":null,"value":"wfxuF3N8XEK6"}
> ]}
> real  15m41.439s
> user  0m0.012s
> sys   0m0.012s
> ## Database with 100 000 docs of 11Kb each
> Document template is at:  
> https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_11k.json
> Sizes (branch vs trunk):
> $ du -m couchdb/tmp/lib/disk_json_test_11kb.couch
> 1185  couchdb/tmp/lib/disk_json_test_11kb.couch
> $ du -m couchdb-trunk/tmp/lib/disk_ejson_test_11kb.couch
> 2202  couchdb-trunk/tmp/lib/disk_ejson_test_11kb.couch
> Time, from a user's perpective, to build the view index from scratch:
> $ time curl 
> http://localhost:5984/disk_json_test_11kb/_design/test/_view/simple?limit=1
> {"total_rows":100000,"offset":0,"rows":[
> {"id":"00001511-831c-41ff-9753-02861bff73b3","key":null,"value":"2fQUbzRUax4A"}
> ]}
> real  4m19.306s
> user  0m0.008s
> sys   0m0.004s
> $ time curl 
> http://localhost:5985/disk_ejson_test_11kb/_design/test/_view/simple?limit=1
> {"total_rows":100000,"offset":0,"rows":[
> {"id":"00001511-831c-41ff-9753-02861bff73b3","key":null,"value":"2fQUbzRUax4A"}
> ]}
> real  18m46.051s
> user  0m0.008s
> sys   0m0.016s
> All in all, I haven't seen yet any disadvantage with this approach. Also, the 
> code changes
> don't bring additional complexity. I say the performance and disk space gains 
> it gives are
> very positive.
> This branch still needs to be polished in a few places. But I think it isn't 
> far from getting mature.
> Other experiments that can be done are to store view values as raw JSON 
> binaries as well (instead of EJSON)
> and optional compression of the stored JSON binaries (since it's pure text, 
> the compression ratio is very high).
> However, I would prefer to do these other 2 suggestions in separate 
> branches/patches - I haven't actually tested
> any of them yet, so maybe they not bring significant gains.
> Thoughts? :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

Reply via email to