Hi All, I propose to merge in the following work, https://github.com/rnewson/couchdb/tree/nebraska-merge-candidate to the official Apache CouchDB repository to a new branch (i.e, *not* master). Once there, the full CouchDB developer community can begin the work to incorporate the code here into an official release.
You do not need to respond if you are in agreement. If there is no response in 72 hours, I will assume lazy consensus. If we reach consensus, I will start the IP clearance process and then the merge. As most of you know, Paul Davis and I recently sequestered ourselves away from society (in a place called Nebraska) to make this merge happen. I want to clarify that this work is not the BigCouch code you can see on github.com/cloudant/bigcouch but the Cloudant platform from which BigCouch was made. This means it is bang up to date with all the bug fixes and feature enhancements we've made in the last eighteen months or more. With that clarification made, here are our notes about what we achieved, what it means to the project and what isn't yet done; Nebraska Merge Roundup Stats: 1402 - total new commits 312 - commits written during the merge (will be reduced substantially by squashing) 408 - number of files changed 21,897 - number of lines added 4,277 - number of lines removed A retrospective: Bob Newson and I have come to the end of our merge sprint on getting BigCouch merged into Apache CouchDB. Its been a productive ten days here in the midwest. I managed to get Bob out to a bowling alley and he managed to get me to a sushi restaurant. In between the cultural exchanges we’ve also managed to get a significant amount of work done on the merging as well. The current status of the merge is that we’ve managed to resolve the differences in the single node execution of CouchDB. Both the JavaScript and Erlang test suites run with only one failure in the Erlang test suite due to a (deliberately) missing constraint on the number of operating system processes. This should be a relatively straightforward fix but was not prioritized during our limited time to work on the larger issues. We merged a large number of performance and stability enhancements back into single node CouchDB as well as a number of pure bug fixes. The biggest highlight is a brand new compactor that is both faster and creates smaller and better organized post-compaction databases. The current status of the merge is that single node operations should be completely unaffected as demonstrated by the test suite passing. On the other hand we haven’t yet finished getting the clustered code merged to use some of the new changes in single node CouchDB. The single most significant portion of this work involves updates to the internal cluster API for views to use the recently rewritten indexer APIs. This should be a relatively straightforward bit of work that we’ll be finishing over the next few weeks. All in all the merge work done so far has been quite successful. We’ve met our primary goal of getting the code merged in a fashion that does not affect single node operation while providing a starting point for the larger community to start reviewing the more significant changes made. Given the size of the diff between the two code bases we never expected to have a fully working clustered solution after ten days of work but we have succeeded in providing a base of work that will allow us and new contributors to get up to speed quickly. This work, coupled with work by Dave Cottlehuber and Benoît Chesneau on updating the build system and various other internal updates, will provide a solid foundation for work going forward. Its an exciting time for CouchDB and anyone interested should keep an eye on the next few releases as we ramp up work on various core aspects of the database. We’ve had an exciting few days working to prepare the road for an exciting next twelve to eighteen months. We hope that everyone will feel as excited as we do about the next twelve to eighteen months for Apache CouchDB. It should be an exciting ride. Things we got done * Large update to the source tree layout for Erlang applications. Each application now has a src/appname/(c_src|ebin|priv|src) structure. The build system has been updated. * Renamed src/couchdb to src/couch to match the Erlang convention of the top directory name matching the Erlang application name. * Imported Cloudant Erlang applications for clustered CouchDB. These are imported with their history by using git subtree and merging the top level commit. These are not external deps, development will happen within the CouchDB tree. The imported apps are: * config - A couch_config replacement (Behavior is mostly identical to couch_config except how we listen for configuration changes internally to allow for smooth hot code upgrade). * twig - An rsyslog source replacement for couch_log. * rexi - An RPC library. Replaces Erlang’s built-in rex application to avoid costly safety measures in the interest of performance and throughput. * mem3 - The “Dynamo” part of BigCouch responsible for managing cluster state * fabric - The internal cluster-aware CouachDB API * ets_lru - A small library application that provides an LRU implementation using a couple ets tables. * ddoc_cache - Caches design documents on each node for use in design handler functions. This uses an ets_lru cache with a very short TTL. * chttpd - The cluster aware HTTP layer Each imported app also had its build system updated to use Autotools along with the necessary updates noted above for the new application layouts for existing CouchDB erlang apps. * Merged a large amount of updates and fixes to couch_replicator based on work done internally at Cloudant. Unfortunately due to an error when we created our internal clone we lost a bit of history in some of the initial merge and have a big commit that affects couch_replicator_manager mostly. There are a number of other commits related to couch_replicator that resolve the single node vs. clustered differences. Some noticeable couch_replicator features: * Optionally disable checkpoints so that replication can work when a source is read only. This should only be used for smaller databases as each replication call has to scan the entire source database on each invocation. * A new changes_pending field in the _active_tasks output * A fix to the continuous replication to automatically reconnect to a continuous changes feed when it sees a last_seq value. This allows for the source to selectively recycle the HTTP connections used which can be quite useful for “permanent” replications. * A multitude of smaller bug fix and stability enhancements. Updates to single node couch: * We changed the by_seq tree to store a copy of the #full_doc_info{} record instead of the #doc_info{} record. This gives significant speed improvements for compaction and replication and generally anything that needs to walk the by_seq tree and access document bodies internally. * We rewrote the compactor to be significantly faster as well as provides significantly better compacted databases. The two main halves are to use a temp file and replace the use of btrees in the temp file. The temp file only contains a temporary copy of the document ids. At the end of a compaction run we then rebuild the by_id btree in the compaction file from this temp file. The reason this helps so much is that the compaction is based on the update_seq btree, which for most cases means that the id tree is updated in roughly random order which is very bad for our append only btrees. By using the tmp file we can stream it in order back into the compacted db file at the end of compacting, generating a minimum amount of garbage in the process. The other upgrade was to implement an external merge sort module (couch_emsort) that is used with this temporary file. * Reject updates to design docs that introduce updates that break compilation for source code. Currently we only check map and reduce calls as the other should provide user visible errors instead of inexplicably empty views. because my OCD kicked in and I was unable to resist. * Reverted a change made a long time ago that uses two file descriptors for each database. See the todo list. * The reason to remove the second fd is so that we can rewrite ref counting. Better ref counting makes everyone happy, but the real reason is for this next bullet point: * Optimize couch_server to not require a round trip message pass for opening a database that’s in the LRU. This is a significant performance boost for high concurrency access. We also optimized couch_server internals to not blow up when it’s under load. * Introduce a #leaf{} record into the revision trees. This is never written to disk but makes internal code a lot cleaner when dealing with multiple versions of rev tree values. * Some changes to couch_changes to enable clustered access. Also some general cleanup * Internal changes to how CouchDB is booted in Erlang land. Not very sexy but this removes a lot of complicated un-Erlangy bits. We still have a bit of work left here. * btree chunk sizes are now configurable which can allow people to adjust the RAM/speed tradeoffs a bit more. * We now load update validation functions on the first write. This is a cluster-motivated change because the clustered version of this call is expensive and can lead to race conditions when opening a bunch of db shards simultaneously. This should be invisible to external clients. * Disabled conflict detection for local docs. They don’t replicate so there’s no point. This just led to clusters getting stuck and confused when there were lots of replications happening. * Changes to the multipart/mime parsing code. Necessary for clustered attachment uploads to split the incoming data stream into N copies. * Don’t use init:restart/0 when reloading the ICU driver. I think this has a bug. But we should rewrite this driver to be a NIF anyway. * New couch OS process manager. Significantly faster access to OS processes under heavy load. This replaces the hard limit with a soft limit. Process spawned over the soft limit will be used until they’ve sat idle for a few minutes and then be closed. We have a todo item to add the hard ceiling back in (while keeping the soft ceiling). * Automatically replace some easily identifiable JS reductions with their builtin counterparts. Uses a regex to do the detection so its not too smart. * Improved view updater write batch. * Updates to couchjs’ views.js to improve index update speeds * Updates to the _stats bultin reduce to allow reduces to work over emitted stats objects. Sometimes clients have summary data in a doc, and this allows them to combine stats if they follow the same pattern as the builtin expects. * Added a config:reload() that is accessible by POST’ing to _config/_reload. Used by the JS tests to reset the config to what's on disk. This should prevent those test run failures where a test fails leaving the config in a bad state causing all subsequent tests to fail. I think. Maybe. * Databases are deleted synchronously in the test suite. We may need to address this on Windows. But it does seem to reduce the number of “{error, file_exists}” failures. * I reimplemented the JS restartServer() function. There’s a new _restart/token URL that will given a unique value for each instance of the Erlang VM. To run a restart we grab the current token value, hit _restart, then wait till we get a successful response with a different token. This appears to have made the restart strategy more robust. Things that need doing IP Clearance - We’ll need to track down if we have the CCLA as well as look at each source file added to make sure each one is strictly from Cloudant or has an amenable license. I’m pretty sure that the only one of interest is trunc_io.erl but we need to be thorough. documentation - There shouldn’t be much here since the entire point of this merge was to not change the visible behavior of single node couch. A few things to add about the testing endpoints. Maybe an update to the compaction section mention the two new file names used. Copyright notices - We need to strip out copyright notices from individual files and make sure all files have a standard Apache License v2 header. clustered vhosts - We’ve never implemented this at Cloudant. We either need to write a cluster or go back and tell people to use HAProxy (or similar) for such things. twig - We need to add another output type to twig that is configurable in some manner. Right now we spit out entire rsyslog records which isn’t useful for most people. We’ll need to implement the file writer from couch_log as well as update the _log HTTP handler to know when it can and can’t expect to find data on disk. fabric - This is going to need a lot of work. Specifically view access is going to need to be updated to work with couch_mrview and friends. Boot a dev cluster - Once we fix up the clustering code we’ll need to write instructions and scripts for pulling up a dev cluster. OTP stuff - We’ve updated each app but we still need to pull some parts out of couchdb into their own application. Specifically the HTTP layer needs its own app. We could probably pull out the os process/query_servers as well as the os daemons and friends. Once done we need to update the supervision trees so we don’t have things like couch starting and managing the replication manager process. ddoc_cache - Wire this up in couch_httpd_db to actually be used. Right now its only used in chttpd. couch_file upgrade - The revert to remove the second updater_fd from each #db{} record means that we’re back in the original position of files appearing to slow down significantly under load. Since the initial hammer approach of just adding a second fd we’ve since discovered that the underlying bug is due to the way that message passing works combined with Erlang’s file io. Significantly though is the fact that the fix is rather simple to implement. A first draft of this work is on an old branch of mine here: https://github.com/davisp/couchdb/commit/d856878 finish the size calculating changes - The #leaf{} record change is to enable us to add more data size calculations. CouchDB master calculates a data size that account for all bytes that are active in a .couch file. Cloudant is interested in the total size of uncompressed docs and attachments minus the internal overhead of btrees. And there’s a fourth number to calculate based on the compression level used. Having each of these numbers will be useful as well as the calculations they’ll enable (ie, dead bytes in file, bytes used for overhead, compression ratio achieved, etc). couch_proc_manager - We need to implement the hard ceiling for capping the number of OS processes. We’ve started seeing a need for this at Cloudant with some work loads so motivation to fix this is high. The only failing etap is the assertion of this ceiling. Synchronous db delete on Windows - I did this because running the test suite was driving me bonkers. I need to ask Dave about how this behaves on Windows (my guess is not well) but I think we can close things up so that it works better than the status quo.