[VOTE] Merge BigCouch

Robert Newson Tue, 07 May 2013 13:34:12 -0700

Hi All,

I propose to merge in the following work,
https://github.com/rnewson/couchdb/tree/nebraska-merge-candidate to
the official Apache CouchDB repository to a new branch (i.e, *not*
master). Once there, the full CouchDB developer community can begin
the work to incorporate the code here into an official release.


You do not need to respond if you are in agreement. If there is no
response in 72 hours, I will assume lazy consensus. If we reach
consensus, I will start the IP clearance process and then the merge.

As most of you know, Paul Davis and I recently sequestered ourselves
away from society (in a place called Nebraska) to make this merge
happen. I want to clarify that this work is not the BigCouch code you
can see on github.com/cloudant/bigcouch but the Cloudant platform from
which BigCouch was made. This means it is bang up to date with all the
bug fixes and feature enhancements we've made in the last eighteen
months or more. With that clarification made, here are our notes about
what we achieved, what it means to the project and what isn't yet
done;

Nebraska Merge Roundup


Stats:


1402 - total new commits

312 - commits written during the merge (will be reduced substantially
by squashing)

408 - number of files changed

21,897 - number of lines added

4,277 - number of lines removed

A retrospective:

Bob Newson and I have come to the end of our merge sprint on getting
BigCouch merged into Apache CouchDB. Its been a productive ten days
here in the midwest. I managed to get Bob out to a bowling alley and
he managed to get me to a sushi restaurant. In between the cultural
exchanges we’ve also managed to get a significant amount of work done
on the merging as well.


The current status of the merge is that we’ve managed to resolve the
differences in the single node execution of CouchDB. Both the
JavaScript and Erlang test suites run with only one failure in the
Erlang test suite due to a (deliberately) missing constraint on the
number of operating system processes. This should be a relatively
straightforward fix but was not prioritized during our limited time to
work on the larger issues.


We merged a large number of performance and stability enhancements
back into single node CouchDB as well as a number of pure bug fixes.
The biggest highlight is a brand new compactor that is both faster and
creates smaller and better organized post-compaction databases.


The current status of the merge is that single node operations should
be completely unaffected as demonstrated by the test suite passing. On
the other hand we haven’t yet finished getting the clustered code
merged to use some of the new changes in single node CouchDB. The
single most significant portion of this work involves updates to the
internal cluster API for views to use the recently rewritten indexer
APIs. This should be a relatively straightforward bit of work that
we’ll be finishing over the next few weeks.


All in all the merge work done so far has been quite successful. We’ve
met our primary goal of getting the code merged in a fashion that does
not affect single node operation while providing a starting point for
the larger community to start reviewing the more significant changes
made. Given the size of the diff between the two code bases we never
expected to have a fully working clustered solution after ten days of
work but we have succeeded in providing a base of work that will allow
us and new contributors to get up to speed quickly.


This work, coupled with work by Dave Cottlehuber and Benoît Chesneau
on updating the build system and various other internal updates, will
provide a solid foundation for work going forward. Its an exciting
time for CouchDB and anyone interested should keep an eye on the next
few releases as we ramp up work on various core aspects of the
database.


We’ve had an exciting few days working to prepare the road for an
exciting next twelve to eighteen months. We hope that everyone will
feel as excited as we do about the next twelve to eighteen months for
Apache CouchDB. It should be an exciting ride.



Things we got done


* Large update to the source tree layout for Erlang applications. Each
application now has a src/appname/(c_src|ebin|priv|src) structure. The
build system has been updated.

* Renamed src/couchdb to src/couch to match the Erlang convention of
the top directory name matching the Erlang application name.

* Imported Cloudant Erlang applications for clustered CouchDB. These
are imported with their history by using git subtree and merging the
top level commit. These are not external deps, development will happen
within the CouchDB tree. The imported apps are:


   * config - A couch_config replacement (Behavior is mostly identical
to couch_config except how we listen for configuration changes
internally to allow for smooth hot code upgrade).

   * twig - An rsyslog source replacement for couch_log.

   * rexi - An RPC library. Replaces Erlang’s built-in rex application
to avoid costly safety measures in the interest of performance and
throughput.

   * mem3 - The “Dynamo” part of BigCouch responsible for managing cluster state

   * fabric - The internal cluster-aware CouachDB API

   * ets_lru - A small library application that provides an LRU
implementation using a couple ets tables.

   * ddoc_cache - Caches design documents on each node for use in
design handler functions. This uses an ets_lru cache with a very short
TTL.

   * chttpd - The cluster aware HTTP layer


Each imported app also had its build system updated to use Autotools
along with the necessary updates noted above for the new application
layouts for existing CouchDB erlang apps.


* Merged a large amount of updates and fixes to couch_replicator based
on work done internally at Cloudant. Unfortunately due to an error
when we created our internal clone we lost a bit of history in some of
the initial merge and have a big commit that affects
couch_replicator_manager mostly. There are a number of other commits
related to couch_replicator that resolve the single node vs. clustered
differences. Some noticeable couch_replicator features:


   * Optionally disable checkpoints so that replication can work when
a source is read only. This should only be used for smaller databases
as each replication call has to scan the entire source database on
each invocation.

   * A new changes_pending field in the _active_tasks output

   * A fix to the continuous replication to automatically reconnect to
a continuous changes feed when it sees a last_seq value. This allows
for the source to selectively recycle the HTTP connections used which
can be quite useful for “permanent” replications.

   * A multitude of smaller bug fix and stability enhancements.


Updates to single node couch:


 * We changed the by_seq tree to store a copy of the #full_doc_info{}
record instead of the #doc_info{} record. This gives significant speed
improvements for compaction and replication and generally anything
that needs to walk the by_seq tree and access document bodies
internally.

 * We rewrote the compactor to be significantly faster as well as
provides significantly better compacted databases. The two main halves
are to use a temp file and replace the use of btrees in the temp file.
The temp file only contains a temporary copy of the document ids. At
the end of a compaction run we then rebuild the by_id btree in the
compaction file from this temp file. The reason this helps so much is
that the compaction is based on the update_seq btree, which for most
cases means that the id tree is updated in roughly random order which
is very bad for our append only btrees. By using the tmp file we can
stream it in order back into the compacted db file at the end of
compacting, generating a minimum amount of garbage in the process. The
other upgrade was to implement an external merge sort module
(couch_emsort) that is used with this temporary file.

 * Reject updates to design docs that introduce updates that break
compilation for source code. Currently we only check map and reduce
calls as the other should provide user visible errors instead of
inexplicably empty views.

because my OCD kicked in and I was unable to resist.

 * Reverted a change made a long time ago that uses two file
descriptors for each database. See the todo list.

 * The reason to remove the second fd is so that we can rewrite ref
counting. Better ref counting makes everyone happy, but the real
reason is for this next bullet point:

 * Optimize couch_server to not require a round trip message pass for
opening a database that’s in the LRU. This is a significant
performance boost for high concurrency access. We also optimized
couch_server internals to not blow up when it’s under load.

 * Introduce a #leaf{} record into the revision trees. This is never
written to disk but makes internal code a lot cleaner when dealing
with multiple versions of rev tree values.

 * Some changes to couch_changes to enable clustered access. Also some
general cleanup

 * Internal changes to how CouchDB is booted in Erlang land. Not very
sexy but this removes a lot of complicated un-Erlangy bits. We still
have a bit of work left here.

 * btree chunk sizes are now configurable which can allow people to
adjust the RAM/speed tradeoffs a bit more.

 * We now load update validation functions on the first write. This is
a cluster-motivated change because the clustered version of this call
is expensive and can lead to race conditions when opening a bunch of
db shards simultaneously. This should be invisible to external
clients.

 * Disabled conflict detection for local docs. They don’t replicate so
there’s no point. This just led to clusters getting stuck and confused
when there were lots of replications happening.

 * Changes to the multipart/mime parsing code. Necessary for clustered
attachment uploads to split the incoming data  stream into N copies.

 * Don’t use init:restart/0 when reloading the ICU driver. I think
this has a bug. But we should rewrite this driver to be a NIF anyway.

 * New couch OS process manager. Significantly faster access to OS
processes under heavy load. This replaces the hard limit with a soft
limit. Process spawned over the soft limit will be used until they’ve
sat idle for a few minutes and then be closed. We have a todo item to
add the hard ceiling back in (while keeping the soft ceiling).

 * Automatically replace some easily identifiable JS reductions with
their builtin counterparts. Uses a regex to do the detection so its
not too smart.

 * Improved view updater write batch.

 * Updates to couchjs’ views.js to improve index update speeds

 * Updates to the _stats bultin reduce to allow reduces to work over
emitted stats objects. Sometimes clients have summary data in a doc,
and this allows them to combine stats if they follow the same pattern
as the builtin expects.

 * Added a config:reload() that is accessible by POST’ing to
_config/_reload. Used by the JS tests to reset the config to what's on
disk. This should prevent those test run failures where a test fails
leaving the config in a bad state causing all subsequent tests to
fail. I think. Maybe.

 * Databases are deleted synchronously in the test suite. We may need
to address this on Windows. But it does seem to reduce the number of
“{error, file_exists}” failures.

 * I reimplemented the JS restartServer() function. There’s a new
_restart/token URL that will given a unique value for each instance of
the Erlang VM. To run a restart we grab the current token value, hit
_restart, then wait till we get a successful response with a different
token. This appears to have made the restart strategy more robust.



Things that need doing


IP Clearance -


We’ll need to track down if we have the CCLA as well as look at each
source file added to make sure each one is strictly from Cloudant or
has an amenable license. I’m pretty sure that the only one of interest
is trunc_io.erl but we need to be thorough.

documentation -


There shouldn’t be much here since the entire point of this merge was
to not change the visible behavior of single node couch. A few things
to add about the testing endpoints. Maybe an update to the compaction
section mention the two new file names used.


Copyright notices -


We need to strip out copyright notices from individual files and make
sure all files have a standard Apache License v2 header.


clustered vhosts -


We’ve never implemented this at Cloudant. We either need to write a
cluster or go back and tell people to use HAProxy (or similar) for
such things.


twig -


We need to add another output type to twig that is configurable in
some manner. Right now we spit out entire rsyslog records which isn’t
useful for most people. We’ll need to implement the file writer from
couch_log as well as update the _log HTTP handler to know when it can
and can’t expect to find data on disk.


fabric -


This is going to need a lot of work. Specifically view access is going
to need to be updated to work with couch_mrview and friends.


Boot a dev cluster -


Once we fix up the clustering code we’ll need to write instructions
and scripts for pulling up a dev cluster.


OTP stuff -


We’ve updated each app but we still need to pull some parts out of
couchdb into their own application. Specifically the HTTP layer needs
its own app. We could probably pull out the os process/query_servers
as well as the os daemons and friends. Once done we need to update the
supervision trees so we don’t have things like couch starting and
managing the replication manager process.


ddoc_cache -


Wire this up in couch_httpd_db to actually be used. Right now its only
used in chttpd.


couch_file upgrade -


The revert to remove the second updater_fd from each #db{} record
means that we’re back in the original position of files appearing to
slow down significantly under load. Since the initial hammer approach
of just adding a second fd we’ve since discovered that the underlying
bug is due to the way that message passing works combined with
Erlang’s file io. Significantly though is the fact that the fix is
rather simple to implement. A first draft of this work is on an old
branch of mine here:


   https://github.com/davisp/couchdb/commit/d856878


finish the size calculating changes -


The #leaf{} record change is to enable us to add more data size
calculations. CouchDB master calculates a data size that account for
all bytes that are active in a .couch file. Cloudant is interested in
the total size of uncompressed docs and attachments minus the internal
overhead of btrees. And there’s a fourth number to calculate based on
the compression level used. Having each of these numbers will be
useful as well as the calculations they’ll enable (ie, dead bytes in
file, bytes used for overhead, compression ratio achieved, etc).


couch_proc_manager -


We need to implement the hard ceiling for capping the number of OS
processes. We’ve started seeing a need for this at Cloudant with some
work loads so motivation to fix this is high. The only failing etap is
the assertion of this ceiling.


Synchronous db delete on Windows -


I did this because running the test suite was driving me bonkers. I
need to ask Dave about how this behaves on Windows (my guess is not
well) but I think we can close things up so that it works better than
the status quo.

[VOTE] Merge BigCouch

Reply via email to