[ 
https://issues.apache.org/jira/browse/COUCHDB-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027182#comment-13027182
 ] 

Norman Barker commented on COUCHDB-1120:
----------------------------------------

I was referring to trade off between access speeds using snappy vs gzip and 
file size. It works well.

> Snappy compression (databases,  view indexes) + keeping doc bodies as ejson 
> binaries
> ------------------------------------------------------------------------------------
>
>                 Key: COUCHDB-1120
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1120
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>         Environment: trunk
>            Reporter: Filipe Manana
>            Assignee: Filipe Manana
>
> The branch at:
> https://github.com/fdmanana/couchdb/compare/snappy
> Is an experiment which adds snappy compression to database files and view 
> index files. Snappy is a very fast compressor/decompressor developed by and 
> used by Google [1] - even for small data chunks like 100Kb it can be 2 orders 
> of magnitude faster then zlib or Erlang's term_to_binary compression level 1. 
> Somewhere at [1] there are benchmark results published by Google that compare 
> against zlib's deflate, Erlang's term_to_binary compression, lzo, etc.
> Even small objects like database headers or btree nodes, still get smaller 
> after compressing them with snappy, see the shell session at [2].
> Besides the compression, this branch also keeps the document bodies 
> (#doc.body fields) as binaries (snappy compressed ejson binaries) and only 
> converts them back to ejson when absolutely needed (done by 
> couch_doc:to_json_obj/2 for e.g.) - this is similar to COUCHDB-1092 - but the 
> bodies are EJSON compressed binaries and doesn't suffer from the same issue 
> Paul identified before (which could be fixed without many changes) - on reads 
> we decompress and still do the binary_to_term/1 + ?JSON_ENCODE calls as 
> before.
> It also prepares the document summaries before sending the documents to the 
> updater, so that we avoid copying EJSON terms and move this task outside of 
> the updater to add more parallelism to concurrent updates.
> I made some tests, comparing trunk before and after the JSON parser NIF was 
> added, against this snappy branch.
> I created databases with 1 000 000 documents of 4Kb each. The document 
> template is this one:  http://friendpaste.com/qdfyId8w1C5vkxROc5Thf
> The databases have this design document:
> {
>     "_id": "_design/test",
>     "language": "javascript",
>     "views": {
>         "simple": {
>             "map": "function(doc) { emit(doc.data5.float1, [doc.strings[2], 
> doc.strings[10]]); }"
>         }
>     }
> }
> == Results with trunk ==
> database file size after compaction:  7.5 Gb
> view index file size after compaction:  257 Mb
> ** Before JSON nif:
> $ time curl 
> 'http://localhost:5985/trunk_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real  58m28.599s
> user  0m0.036s
> sys   0m0.056s
> ** After JSON nif:
> fdmanana 12:45:55 /opt/couchdb > time curl 
> 'http://localhost:5985/trunk_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real  51m14.738s
> user  0m0.040s
> sys   0m0.044s
> == Results with the snappy branch ==
> database file size after compaction:  3.2 Gb   (vs 7.5 Gb on trunk)
> view index file size after compaction:  100 Mb  (vs 257 Mb on trunk)
> ** Before JSON nif:
> $ time curl 
> 'http://localhost:5984/snappy_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real  32m29.854s
> user  0m0.008s
> sys   0m0.052s
> ** After JSON nif:
> fdmanana 15:40:39 /opt/couchdb > time curl 
> 'http://localhost:5984/snappy_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real  18m39.240s
> user  0m0.012s
> sys   0m0.020s
> A writes-only relaximation test also shows a significant improvement in the 
> writes response times / throughput:
> http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63405480d
> These results are also in a file of this branch [3].
> Seems clear this, together with Paul's JSON NIF parser, has a very good 
> impact in the view indexer, besides the big disk space savings and better 
> write throughput.
> Some potential issues:
> * Snappy is C++, and so is the NIF [4] - however a C++ compiler is common and 
> part of most development environments (gcc, xcode, etc)
> * Not sure if snappy builds on Windows - it might build, it doesn't seem to 
> depend on fancy libraries, just stdc++ and the STL
> * Requires OTP R13B04 or higher. If built/running on R13B03 or below, it 
> simple doesn't do any compression at all, just like current releases. 
> However, 2 servers running this branch, one with R14 and other R13B01 for 
> e.g., means that the second server will not be able to read database files 
> created by the server with R14 - it will get an exception with the atom 
> 'snappy_nif_not_loaded' - this is easy to catch and use for printing a nice 
> and explicit error message to the user telling it needs to use a more recent 
> otp release.
> The upgrade of databases and view indexes from previous releases is done on 
> compaction - I made just a few tests with database files by hand, this surely 
> needs to be better tested.
> Finally the branch is still in development phase, but maybe not far from 
> completion, consider this ticket just as a way to share some results and get 
> some feedback.
> [1] - http://code.google.com/p/snappy/
> [2] - http://friendpaste.com/45AOdi9MkFrS4BPsov7Lg8
> [3] - 
> https://github.com/fdmanana/couchdb/blob/b8f806e41727ba18ed6143cee31a3242e024ab2c/snappy-couch-tests.txt
> [4] - https://github.com/fdmanana/snappy-erlang-nif/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to