Great, but didn't David asking about POST, not PUT? *ducks* Seriously, though, great post. This will really help with learning CouchDB and Erlang. Thanks!
~Ryan On Thu, Nov 1, 2012 at 5:47 AM, Garren Smith <[email protected]> wrote: > This is a brilliant explanation, Thanks Jan. Nice way to learn how > everything fits together. > > Cheers > Garren > > On 01 Nov 2012, at 1:22 PM, Jan Lehnardt <[email protected]> wrote: > > > Heya David, > > > > On Nov 1, 2012, at 08:39 , 高大为 <[email protected]> wrote: > > > >> Hi Erlang/CouchDB, > >> > >> Recently I am trying to read the source code of CouchDB, and got some > >> knowledge that how the CouchDB booting up. > >> > >> Right now I want to learn, when send a request, for example POST a > >> document, what parts of CouchDB code will do from handle the request to > >> save the data into disk / filesystem. > >> > >> In other words, POST doc ===> save data to disk / filesystem, what > parts of > >> code will work for the whole procedure? > >> > >> Regards & Thanks! > >> David > > > > > > I’ve been waiting for an excuse to do a deep-dive like this, thanks! :) > > > > Given that you already dug around some yourself, I omit the “how to get > > to the code” section. I am on current master 1a9143e. > > > > Let’s start at the HTTP API: src/couchdb/couch_httpd.erl > > > > `couch_httpd` is the main entry point for all request handling in > > CouchDB. Its responsibilities are: > > > > - Read the CouchDB configuration to configure itself with all > > settings a user wishes to have for handling requests. > > - Set up a socket to listen on for incoming requests. > > - Set up a list of request handlers that map API actions to > > internal module calls that do actual work. > > - Start Mochiweb to handle everything related to HTTP. > > - Export a number of functions that the request handler sub > > modules can use to handle requests. > > > > The sub-modules are all in src/couchdb/: > > > > - couch_httpd.erl > > - couch_httpd_auth.erl > > - couch_httpd_db.erl > > - couch_httpd_external.erl > > - couch_httpd_misc_handlers.erl > > - couch_httpd_oauth.erl > > - couch_httpd_proxy.erl > > - couch_httpd_rewrite.erl > > - couch_httpd_stats_handlers.er > > - couch_httpd_vhost.erl > > > > The mapping of request handlers to URLs happens in the CouchDB > > configuration. The defaults are set in etc/couchdb/default.ini, > > which in source form is called etc/couchdb/default.ini.tpl.in, > > meaning that there are two layers of replacing variables going > > on until we get a final default.ini. For the request handlers, > > we can look at default.ini.tpl.in. > > > > The mapping of URLs to request handlers happen on three layers: > > > > - Global handlers for things like `/`, `/_utils`, `_config` etc. > > - Database handlers like `/db/_all_docs` or `/db/_compact`. > > - Design document handlers like `/db/_design/docid/_view` > > > > > > With this knowledge, let’s trace this HTTP Requst: > > > > POST /db/docid > > ... > > {"a":1} > > > > Or in `curl`: > > > > $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}' > > > > > > The request goes to a `/db` URL, so we’ll have a look at the > > `[httpd_database_handlers]` section of default.ini.tpl.in: > > > > > > [httpd_db_handlers] > > _all_docs = {couch_mrview_http, handle_all_docs_req} > > _changes = {couch_httpd_db, handle_changes_req} > > _compact = {couch_httpd_db, handle_compact_req} > > _design = {couch_httpd_db, handle_design_req} > > _temp_view = {couch_mrview_http, handle_temp_view_req} > > _view_cleanup = {couch_mrview_http, handle_cleanup_req} > > > > Hm, nothing that looks like a handler for creating documents. > > > > Let’s go back to couch_httpd.erl. In line 138 we see that we > > start Mochiweb with a list of handlers, first of all the > > `DefaultFun`, maybe we need to look at that. We are tracking > > it back to line 102. There’s a bit of gibberish about “arity”, > > we’ll ignore that for now. Then we see that we *do* rely on > > the config system: > > > > couch_config:get("httpd", "default_handler"…). > > > > So let’s look at the `[httpd]` section of default.ini.tpl.in: > > > > default_handler = {couch_httpd_db, handle_request} > > > > That looks promising, let’s find that in code, at > > src/couchdb/couch_httpd_db.erl, line 36. > > > > `handle_request()` first checks whether we want to create or > > delete a database, but when it sees we don’t, it passes our > > request along to `do_db_req()` (line 230), which turns out > > just to be a wrapper that opens a database and calls a callback, > > so back to where `do_db_req()` is called, we see `db_req/2` is > > passed as a callback. > > > > Now `db_req()` has various clauses to differentiate the different > > HTTP request methods it is called with and to allow for all sorts > > of special URLs to be called. We are interested in PUT, but we > > don‘t find that PUT is handled anywhere in particular. We do see > > however, that all the clauses before the last-but-one handle > > something that is *not* put, so we know that our clause is on > > line 464: > > > > db_req(#httpd{path_parts=[_, DocId]}=Req, Db) -> > > db_doc_req(Req, Db, DocId); > > > > Which turns to be yet another indirection, so let’s go with it. > > `do_doc_req` again has a number of clauses to deal with various > > request types. Lucky for us, there is a PUT clause on line 563: > > > > db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) -> > > > > First, the function checks whether we have a valid `DocId`. > > Assuming we do, it checks whether the request is a HTTP multipart > > request or a regular one. We have a regular one and are lucky > > again, our part of code here is rather small: > > > > Body = couch_httpd:json_body(Req), > > Doc = couch_doc_from_req(Req, DocId, Body), > > update_doc(Req, Db, DocId, Doc) > > > > The first line fetches the JSON document body from the `Req` > > variable. At this point, `Body` should equal: `<<"{\"a\":1}">>, > > an Erlang binary that encodes the JSON body we passed in as > > a request. > > > > The second line turns the JSON, together with the `DocId` into > > a CouchDB document. > > > > Finally, we pass all we have now to the `update_doc` function we > > check out later. > > > > `couch_doc_from_req()` figures out whether we are trying to update > > and existing doc with our PUT request, or whether we want to create > > a new one. In our case, not much is done, in the update case, we > > need to pass in a `rev=` query parameter and that is checked here. > > > > In either case though, this function returns a value of the type > > `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl, > > line 99, if you are curious. > > > > With all that in place, we can finally visit `update_doc()`. It again > > has a few clauses starting in line 716 (we are still in > couch_db_httpd.erl) > > > > `update_doc` deals with a number query parameters again until it finally > > calls `couch_db:update_doc()`. > > > > This is our entry into the innards of CouchDB. > > > > Enter `couch_db` in src/couchdb/couch_db.erl. Our function > > `update_doc()` is defined in line 422, and it ultimately seems to > > be a wrapper around `update_docs()` (plural) in the lines starting > > at 688. Update docs has two independent clauses: > > > > update_docs(Db, Docs, Options, replicated_changes) -> > > > > and > > > > update_docs(Db, Docs, Options, interactive_edit) -> > > > > The first one handles replications that can create conflicts in > > document revision lists. The second one deals with regular > > database operations. So that that is for us. > > > > Our `update_docs()` does a number of things: > > > > - prepare for yet more request parameters. > > - separate our `_local` docs and regular docs (ours is a regular one. > > - validate our document against `validate_update_function`s, if they > exist. > > - check whether we provided the correct `rev` in case of updates. > > - > > > > And Finally: > > > > {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs, > Options2), > > > > Let’s jump there, line 831: > > > > After doing some more preparations that I will gloss over, we see > > that CouchDB keeps around an `UpdatePid` in the `#db{}` record that > > is passed down with us so far. This `UpdatePid` is the process ID of > > a process that deals with database updates. > > > > In CouchDB, each database has a single process handling writes to the > > database, to ensure a consistent database file. > > > > In `write_and_commit()` we send a message to that process with the > message > > `update_docs` (in line 839): > > > > UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs, > MergeConflicts, FullCommit}, > > > > Let’s see where that message is handled. > > > > We need to know that the module that the `UpdatePid` runs is an > > instance of the `couch_db_updater` module. We would have found that > > out in `couch_db:init()`. > > > > The `update_docs` message is handled in src/couchdb/couch_db_update.erl > > in line 223. > > > > After receiving the whole message, with all docs (in our case, a list > with > > just our document) is sent to `update_docs_int()` (line 672). > > > > `open_docs_int()` handles access to CouchDB’s main database data > structure, > > the B+-tree. In fact, there are two B+-trees in each database at the same > > time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first > > one contains all document data indexed by document id. The second one > > includes pointers to the fulldocinfo btree indexed by update sequence. > The > > by_seq btree is what drives CouchDB’s /_changes feature which in turn > > powers replication, compaction and view creation. > > > > A new document is inserted in both indexes in lines 705 and 706: > > > > {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree, > IndexFullDocInfos, []), > > {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree, > IndexDocInfos, RemoveSeqs), > > > > At this point, our docs lives in the database structure, has been > > assigned a new `rev`, but it has not yet been written to disk. The > > last operation in `update_docs_int()` is `commit_data()` which > > sounds promising. Let’s jump down. > > > > The definition starts in line 781, the relevant bit for us in line 785. > > The way CouchDB write changes to disk is in this fashion: > > > > 1. write all changes to the data and index trees to the disk. > > 3. write a header to disk that has the current pointers to the index > > trees that we wrote in 1. > > > > Writing to disk does not yet mean that the data actually arrived on > > disk. It might, but we only know for sure after we call the `fsync` > > system call. From Erlang, we call `couch_file:sync()`. > > > > Now there are different classes of behaviour possible in the list above. > > Notice how I left out 2. > > > > Writing a CouchDB file (which can be either a database file or a view > index) > > can give different storage guarantees. The options are to fsync before > > the header is written, or after, or both. An fsync is a potentially > > expensive operation, so we have fine grained control over this here. > > > > The full list is: > > > > 1. write all changes to the data and index trees to the disk. > > 2. fsync. > > 3. write a header to disk that has the current pointers to the index > > trees that we wrote in 1. > > 4. fsync. > > > > 2.-4. happen in `commit_data()`, but wait, where did 1. happen? > > > > For that, we need to jump back to `update_docs_int()`, line 697: > > > > % Write out the document summaries (the bodies are stored in the > nodes of > > % the trees, the attachments are already written to disk) > > {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []), > > > > `flush_trees()` is defined in line 519. It iterates over the new data > > in the database and recursively writes it to disk in line 547: > > > > {ok, NewSummaryPointer, SummarySize} = > > couch_file:append_raw_chunk(Fd, Summary), > > > > Finally, we drop into `couch_file`, the lowest level of CouchDB. > > `append_raw_chunk()` is defined in line 111 and it is just a small > > wrapper that sends the `append_bin` message to the process that > > manages the file descriptor for our database file. > > > > `append_bin` is handled in line 373. It takes the data to be > > written and pads it out to make it a multiple of `?SIZE_BLOCK` > > (which is 4096 bytes). > > > > In line 376 our data is finally written to disk: > > > > file:write(Fd, Blocks) > > > > From here on out we now go back up into `couch_db_updater` and > > deal with the header business we looked at earlier, from there > > it jumps back up into `couch_db` which waits for a success in > > writing the data, and when that shows up, it hands it back to > > `couch_httpd_db` which uses `couch_httpd` to send the successful > > writing of the document as an HTTP response. > > > > This concludes our little tour. > > > > I hope this was helpful! Let us know if there are any questions. > > > > Jan > > -- > > > > > > -- http://twitter.com/rmgraham
