Re: code to handle the procedure - between POST doc and save data in filesystem

Ryan Graham Thu, 01 Nov 2012 09:58:38 -0700

Great, but didn't David asking about POST, not PUT? *ducks*

Seriously, though, great post. This will really help with learning CouchDB
and Erlang. Thanks!


~Ryan


On Thu, Nov 1, 2012 at 5:47 AM, Garren Smith <[email protected]> wrote:

> This is a brilliant explanation, Thanks Jan. Nice way to learn how
> everything fits together.
>
> Cheers
> Garren
>
> On 01 Nov 2012, at 1:22 PM, Jan Lehnardt <[email protected]> wrote:
>
> > Heya David,
> >
> > On Nov 1, 2012, at 08:39 , 高大为 <[email protected]> wrote:
> >
> >> Hi Erlang/CouchDB,
> >>
> >> Recently I am trying to read the source code of CouchDB, and got some
> >> knowledge that how the CouchDB booting up.
> >>
> >> Right now I want to learn, when send a request, for example POST a
> >> document, what parts of CouchDB code will do from handle the request to
> >> save the data into disk / filesystem.
> >>
> >> In other words, POST doc ===> save data to disk / filesystem, what
> parts of
> >> code will work for the whole procedure?
> >>
> >> Regards & Thanks!
> >> David
> >
> >
> > I’ve been waiting for an excuse to do a deep-dive like this, thanks! :)
> >
> > Given that you already dug around some yourself, I omit the “how to get
> > to the code” section. I am on current master 1a9143e.
> >
> > Let’s start at the HTTP API: src/couchdb/couch_httpd.erl
> >
> > `couch_httpd` is the main entry point for all request handling in
> > CouchDB. Its responsibilities are:
> >
> > - Read the CouchDB configuration to configure itself with all
> >   settings a user wishes to have for handling requests.
> > - Set up a socket to listen on for incoming requests.
> > - Set up a list of request handlers that map API actions to
> >   internal module calls that do actual work.
> > - Start Mochiweb to handle everything related to HTTP.
> > - Export a number of functions that the request handler sub
> >   modules can use to handle requests.
> >
> > The sub-modules are all in src/couchdb/:
> >
> > - couch_httpd.erl
> > - couch_httpd_auth.erl
> > - couch_httpd_db.erl
> > - couch_httpd_external.erl
> > - couch_httpd_misc_handlers.erl
> > - couch_httpd_oauth.erl
> > - couch_httpd_proxy.erl
> > - couch_httpd_rewrite.erl
> > - couch_httpd_stats_handlers.er
> > - couch_httpd_vhost.erl
> >
> > The mapping of request handlers to URLs happens in the CouchDB
> > configuration. The defaults are set in etc/couchdb/default.ini,
> > which in source form is called etc/couchdb/default.ini.tpl.in,
> > meaning that there are two layers of replacing variables going
> > on until we get a final default.ini. For the request handlers,
> > we can look at default.ini.tpl.in.
> >
> > The mapping of URLs to request handlers happen on three layers:
> >
> > - Global handlers for things like `/`, `/_utils`, `_config` etc.
> > - Database handlers like `/db/_all_docs` or `/db/_compact`.
> > - Design document handlers like `/db/_design/docid/_view`
> >
> >
> > With this knowledge, let’s trace this HTTP Requst:
> >
> >    POST /db/docid
> >    ...
> >    {"a":1}
> >
> > Or in `curl`:
> >
> >    $ curl -X PUT http://127.0.0.1:5984/db/docid -d '{"a":1}'
> >
> >
> > The request goes to a `/db` URL, so we’ll have a look at the
> > `[httpd_database_handlers]` section of default.ini.tpl.in:
> >
> >
> >    [httpd_db_handlers]
> >    _all_docs = {couch_mrview_http, handle_all_docs_req}
> >    _changes = {couch_httpd_db, handle_changes_req}
> >    _compact = {couch_httpd_db, handle_compact_req}
> >    _design = {couch_httpd_db, handle_design_req}
> >    _temp_view = {couch_mrview_http, handle_temp_view_req}
> >    _view_cleanup = {couch_mrview_http, handle_cleanup_req}
> >
> > Hm, nothing that looks like a handler for creating documents.
> >
> > Let’s go back to couch_httpd.erl. In line 138 we see that we
> > start Mochiweb with a list of handlers, first of all the
> > `DefaultFun`, maybe we need to look at that. We are tracking
> > it back to line 102. There’s a bit of gibberish about “arity”,
> > we’ll ignore that for now. Then we see that we *do* rely on
> > the config system:
> >
> >    couch_config:get("httpd", "default_handler"…).
> >
> > So let’s look at the `[httpd]` section of default.ini.tpl.in:
> >
> >    default_handler = {couch_httpd_db, handle_request}
> >
> > That looks promising, let’s find that in code, at
> > src/couchdb/couch_httpd_db.erl, line 36.
> >
> > `handle_request()` first checks whether we want to create or
> > delete a database, but when it sees we don’t, it passes our
> > request along to `do_db_req()` (line 230), which turns out
> > just to be a wrapper that opens a database and calls a callback,
> > so back to where `do_db_req()` is called, we see `db_req/2` is
> > passed as a callback.
> >
> > Now `db_req()` has various clauses to differentiate the different
> > HTTP request methods it is called with and to allow for all sorts
> > of special URLs to be called. We are interested in PUT, but we
> > don‘t find that PUT is handled anywhere in particular. We do see
> > however, that all the clauses before the last-but-one handle
> > something that is *not* put, so we know that our clause is on
> > line 464:
> >
> >    db_req(#httpd{path_parts=[_, DocId]}=Req, Db) ->
> >        db_doc_req(Req, Db, DocId);
> >
> > Which turns to be yet another indirection, so let’s go with it.
> > `do_doc_req` again has a number of clauses to deal with various
> > request types. Lucky for us, there is a PUT clause on line 563:
> >
> >    db_doc_req(#httpd{method='PUT'}=Req, Db, DocId) ->
> >
> > First, the function checks whether we have a valid `DocId`.
> > Assuming we do, it checks whether the request is a HTTP multipart
> > request or a regular one. We have a regular one and are lucky
> > again, our part of code here is rather small:
> >
> >    Body = couch_httpd:json_body(Req),
> >    Doc = couch_doc_from_req(Req, DocId, Body),
> >    update_doc(Req, Db, DocId, Doc)
> >
> > The first line fetches the JSON document body from the `Req`
> > variable. At this point, `Body` should equal: `<<"{\"a\":1}">>,
> > an Erlang binary that encodes the JSON body we passed in as
> > a request.
> >
> > The second line turns the JSON, together with the `DocId` into
> > a CouchDB document.
> >
> > Finally, we pass all we have now to the `update_doc` function we
> > check out later.
> >
> > `couch_doc_from_req()` figures out whether we are trying to update
> > and existing doc with our PUT request, or whether we want to create
> > a new one. In our case, not much is done, in the update case, we
> > need to pass in a `rev=` query parameter and that is checked here.
> >
> > In either case though, this function returns a value of the type
> > `#doc{}`, which is a record that is defined in src/couchdb/couch_db.hrl,
> > line 99, if you are curious.
> >
> > With all that in place, we can finally visit `update_doc()`. It again
> > has a few clauses starting in line 716 (we are still in
> couch_db_httpd.erl)
> >
> > `update_doc` deals with a number query parameters again until it finally
> > calls `couch_db:update_doc()`.
> >
> > This is our entry into the innards of CouchDB.
> >
> > Enter `couch_db` in src/couchdb/couch_db.erl. Our function
> > `update_doc()` is defined in line 422, and it ultimately seems to
> > be a wrapper around `update_docs()` (plural) in the lines starting
> > at 688. Update docs has two independent clauses:
> >
> >    update_docs(Db, Docs, Options, replicated_changes) ->
> >
> > and
> >
> >    update_docs(Db, Docs, Options, interactive_edit) ->
> >
> > The first one handles replications that can create conflicts in
> > document revision lists. The second one deals with regular
> > database operations. So that that is for us.
> >
> > Our `update_docs()` does a number of things:
> >
> > - prepare for yet more request parameters.
> > - separate our `_local` docs and regular docs (ours is a regular one.
> > - validate our document against `validate_update_function`s, if they
> exist.
> > - check whether we provided the correct `rev` in case of updates.
> > -
> >
> > And Finally:
> >
> >    {ok, CommitResults} = write_and_commit(Db, DocBuckets4, NonRepDocs,
> Options2),
> >
> > Let’s jump there, line 831:
> >
> > After doing some more preparations that I will gloss over, we see
> > that CouchDB keeps around an `UpdatePid` in the `#db{}` record that
> > is passed down with us so far. This `UpdatePid` is the process ID of
> > a process that deals with database updates.
> >
> > In CouchDB, each database has a single process handling writes to the
> > database, to ensure a consistent database file.
> >
> > In `write_and_commit()` we send a message to that process with the
> message
> > `update_docs` (in line 839):
> >
> >   UpdatePid ! {update_docs, self(), DocBuckets, NonRepDocs,
> MergeConflicts, FullCommit},
> >
> > Let’s see where that message is handled.
> >
> > We need to know that the module that the `UpdatePid` runs is an
> > instance of the `couch_db_updater` module. We would have found that
> > out in `couch_db:init()`.
> >
> > The `update_docs` message is handled in src/couchdb/couch_db_update.erl
> > in line 223.
> >
> > After receiving the whole message, with all docs (in our case, a list
> with
> > just our document) is sent to `update_docs_int()` (line 672).
> >
> > `open_docs_int()` handles access to CouchDB’s main database data
> structure,
> > the B+-tree. In fact, there are two B+-trees in each database at the same
> > time: the fulldocinfo_by_id_btree and the docinfo_by_seq_btree. The first
> > one contains all document data indexed by document id. The second one
> > includes pointers to the fulldocinfo btree indexed by update sequence.
> The
> > by_seq btree is what drives CouchDB’s /_changes feature which in turn
> > powers replication, compaction and view creation.
> >
> > A new document is inserted in both indexes in lines 705 and 706:
> >
> >    {ok, DocInfoByIdBTree2} = couch_btree:add_remove(DocInfoByIdBTree,
> IndexFullDocInfos, []),
> >    {ok, DocInfoBySeqBTree2} = couch_btree:add_remove(DocInfoBySeqBTree,
> IndexDocInfos, RemoveSeqs),
> >
> > At this point, our docs lives in the database structure, has been
> > assigned a new `rev`, but it has not yet been written to disk. The
> > last operation in `update_docs_int()` is `commit_data()` which
> > sounds promising. Let’s jump down.
> >
> > The definition starts in line 781, the relevant bit for us in line 785.
> > The way CouchDB write changes to disk is in this fashion:
> >
> > 1. write all changes to the data and index trees to the disk.
> > 3. write a header to disk that has the current pointers to the index
> >    trees that we wrote in 1.
> >
> > Writing to disk does not yet mean that the data actually arrived on
> > disk. It might, but we only know for sure after we call the `fsync`
> > system call. From Erlang, we call `couch_file:sync()`.
> >
> > Now there are different classes of behaviour possible in the list above.
> > Notice how I left out 2.
> >
> > Writing a CouchDB file (which can be either a database file or a view
> index)
> > can give different storage guarantees. The options are to fsync before
> > the header is written, or after, or both. An fsync is a potentially
> > expensive operation, so we have fine grained control over this here.
> >
> > The full list is:
> >
> > 1. write all changes to the data and index trees to the disk.
> > 2. fsync.
> > 3. write a header to disk that has the current pointers to the index
> >    trees that we wrote in 1.
> > 4. fsync.
> >
> > 2.-4. happen in `commit_data()`, but wait, where did 1. happen?
> >
> > For that, we need to jump back to `update_docs_int()`, line 697:
> >
> >    % Write out the document summaries (the bodies are stored in the
> nodes of
> >    % the trees, the attachments are already written to disk)
> >    {ok, FlushedFullDocInfos} = flush_trees(Db2, NewFullDocInfos, []),
> >
> > `flush_trees()` is defined in line 519. It iterates over the new data
> > in the database and recursively writes it to disk in line 547:
> >
> >    {ok, NewSummaryPointer, SummarySize} =
> >        couch_file:append_raw_chunk(Fd, Summary),
> >
> > Finally, we drop into `couch_file`, the lowest level of CouchDB.
> > `append_raw_chunk()` is defined in line 111 and it is just a small
> > wrapper that sends the `append_bin` message to the process that
> > manages the file descriptor for our database file.
> >
> > `append_bin` is handled in line 373. It takes the data to be
> > written and pads it out to make it a multiple of `?SIZE_BLOCK`
> > (which is 4096 bytes).
> >
> > In line 376 our data is finally written to disk:
> >
> >    file:write(Fd, Blocks)
> >
> > From here on out we now go back up into `couch_db_updater` and
> > deal with the header business we looked at earlier, from there
> > it jumps back up into `couch_db` which waits for a success in
> > writing the data, and when that shows up, it hands it back to
> > `couch_httpd_db` which uses `couch_httpd` to send the successful
> > writing of the document as an HTTP response.
> >
> > This concludes our little tour.
> >
> > I hope this was helpful! Let us know if there are any questions.
> >
> > Jan
> > --
> >
> >
>
>


-- 
http://twitter.com/rmgraham

Re: code to handle the procedure - between POST doc and save data in filesystem

Reply via email to