[
https://issues.apache.org/jira/browse/COUCHDB-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13023259#comment-13023259
]
Norman Barker commented on COUCHDB-1120:
----------------------------------------
Checking out and building NIFs is handled by rebar, so that would handle
snappy, but that requires couchdb to go to a rebar structure.
For our use cases (millions of small docs generated quickly) this is working
well. I like (6) since for use cases that are more suited to archival then
picking gzip for storage would be even better, for us, snappy is great gives a
good tradeoff.
> Snappy compression (databases, view indexes) + keeping doc bodies as ejson
> binaries
> ------------------------------------------------------------------------------------
>
> Key: COUCHDB-1120
> URL: https://issues.apache.org/jira/browse/COUCHDB-1120
> Project: CouchDB
> Issue Type: Improvement
> Components: Database Core
> Environment: trunk
> Reporter: Filipe Manana
> Assignee: Filipe Manana
>
> The branch at:
> https://github.com/fdmanana/couchdb/compare/snappy
> Is an experiment which adds snappy compression to database files and view
> index files. Snappy is a very fast compressor/decompressor developed by and
> used by Google [1] - even for small data chunks like 100Kb it can be 2 orders
> of magnitude faster then zlib or Erlang's term_to_binary compression level 1.
> Somewhere at [1] there are benchmark results published by Google that compare
> against zlib's deflate, Erlang's term_to_binary compression, lzo, etc.
> Even small objects like database headers or btree nodes, still get smaller
> after compressing them with snappy, see the shell session at [2].
> Besides the compression, this branch also keeps the document bodies
> (#doc.body fields) as binaries (snappy compressed ejson binaries) and only
> converts them back to ejson when absolutely needed (done by
> couch_doc:to_json_obj/2 for e.g.) - this is similar to COUCHDB-1092 - but the
> bodies are EJSON compressed binaries and doesn't suffer from the same issue
> Paul identified before (which could be fixed without many changes) - on reads
> we decompress and still do the binary_to_term/1 + ?JSON_ENCODE calls as
> before.
> It also prepares the document summaries before sending the documents to the
> updater, so that we avoid copying EJSON terms and move this task outside of
> the updater to add more parallelism to concurrent updates.
> I made some tests, comparing trunk before and after the JSON parser NIF was
> added, against this snappy branch.
> I created databases with 1 000 000 documents of 4Kb each. The document
> template is this one: http://friendpaste.com/qdfyId8w1C5vkxROc5Thf
> The databases have this design document:
> {
> "_id": "_design/test",
> "language": "javascript",
> "views": {
> "simple": {
> "map": "function(doc) { emit(doc.data5.float1, [doc.strings[2],
> doc.strings[10]]); }"
> }
> }
> }
> == Results with trunk ==
> database file size after compaction: 7.5 Gb
> view index file size after compaction: 257 Mb
> ** Before JSON nif:
> $ time curl
> 'http://localhost:5985/trunk_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real 58m28.599s
> user 0m0.036s
> sys 0m0.056s
> ** After JSON nif:
> fdmanana 12:45:55 /opt/couchdb > time curl
> 'http://localhost:5985/trunk_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real 51m14.738s
> user 0m0.040s
> sys 0m0.044s
> == Results with the snappy branch ==
> database file size after compaction: 3.2 Gb (vs 7.5 Gb on trunk)
> view index file size after compaction: 100 Mb (vs 257 Mb on trunk)
> ** Before JSON nif:
> $ time curl
> 'http://localhost:5984/snappy_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real 32m29.854s
> user 0m0.008s
> sys 0m0.052s
> ** After JSON nif:
> fdmanana 15:40:39 /opt/couchdb > time curl
> 'http://localhost:5984/snappy_db_1m/_design/test/_view/simple?limit=1'
> {"total_rows":1000000,"offset":0,"rows":[
> {"id":"00000632-d25d-49c6-9b4e-e038b78ff97d","key":76.572,"value":["jURcBZ0vrJcmf2roZUMzZJQoTsKZDIdj7KhO7itskKvM80jBU9","fKYYthv8iFvaYoFoYZyB"]}
> ]}
> real 18m39.240s
> user 0m0.012s
> sys 0m0.020s
> A writes-only relaximation test also shows a significant improvement in the
> writes response times / throughput:
> http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63405480d
> These results are also in a file of this branch [3].
> Seems clear this, together with Paul's JSON NIF parser, has a very good
> impact in the view indexer, besides the big disk space savings and better
> write throughput.
> Some potential issues:
> * Snappy is C++, and so is the NIF [4] - however a C++ compiler is common and
> part of most development environments (gcc, xcode, etc)
> * Not sure if snappy builds on Windows - it might build, it doesn't seem to
> depend on fancy libraries, just stdc++ and the STL
> * Requires OTP R13B04 or higher. If built/running on R13B03 or below, it
> simple doesn't do any compression at all, just like current releases.
> However, 2 servers running this branch, one with R14 and other R13B01 for
> e.g., means that the second server will not be able to read database files
> created by the server with R14 - it will get an exception with the atom
> 'snappy_nif_not_loaded' - this is easy to catch and use for printing a nice
> and explicit error message to the user telling it needs to use a more recent
> otp release.
> The upgrade of databases and view indexes from previous releases is done on
> compaction - I made just a few tests with database files by hand, this surely
> needs to be better tested.
> Finally the branch is still in development phase, but maybe not far from
> completion, consider this ticket just as a way to share some results and get
> some feedback.
> [1] - http://code.google.com/p/snappy/
> [2] - http://friendpaste.com/45AOdi9MkFrS4BPsov7Lg8
> [3] -
> https://github.com/fdmanana/couchdb/blob/b8f806e41727ba18ed6143cee31a3242e024ab2c/snappy-couch-tests.txt
> [4] - https://github.com/fdmanana/snappy-erlang-nif/
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira