nickva opened a new pull request, #5399:
URL: https://github.com/apache/couchdb/pull/5399

   Let clients issue concurrent pread calls without blocking each other or 
having to wait for all the writes and fsync calls.
   
   Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP 
file backend forces a single controlling process for raw file handles. So, all 
our reads were always funnelled through the couch_file gen_server, having to 
queue up behind potentially slower writes. In particular this is problematic 
with remote file systems, where fsyncs and writes may take a lot longer while 
preads can hit the cache and return quicker.
   
   Parallel pread calls are implemented via a NIF which copies the pread and 
file closing bits from OTP's prim_file NIF [2]. Access to the shared handle is 
controlled via RW locks similar to how emmap does it [3]. Multiple readers can 
"read" acquire the RW lock and issue pread calls in parallel on the same file 
descriptor. If a writer acquires it, all the readers will have to wait for it. 
This kind of synchronization is necessary to carefully manage the closing state.
   
   In order to keep things simple the write path and the opening and handling 
of the main couch_file isn't affected. The pread parallel bypass is a pure 
opportunistic optimization when it's enabled; if not enabled, reads can proceed 
as they always did - through the gen_server.
   
   The cost of enabling it is using at most one extra file descriptor reference 
obtained via the dup() [4] system call from the main couch_file handle. Unlike 
another, newly opened file "descriptrion", the new "descriptor" is just a 
reference pointing to the exact same file description in the kernel and sharing 
all the buffers, position, modes, etc, with the main couch_file. The reason we 
need a new dup()-ed file descriptor is to manage closing very carefully. Since 
on POSIX systems file descriptors are just integers, it's very easy to 
accidentally read from an already closed and re-opened (by something else) file 
descriptor. That's why there are locks and a whole new file descriptor which 
our NIF controls.
   
   Another alternative was to use the exact same file descriptor as the main 
file, and then, after every single pread validate that the data was read from 
the same file by calling fstat and matching major/minor/inode numbers. Then 
also hoping that a pread on any random pipe/socket/stdio handle will never 
cause any issue, block or just quickly return an error.
   
   So far only checked that the cluster starts up, reads and writes go through, 
and a quick sequential benchmark indicates that the plain, sequential reads and 
writes haven't gotten worse, they all seemed to have improved a bit:
   
   ```
   > fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
    *** Parameters
    * batch_size       : 1000
    * doc_size         : small
    * docs             : 100000
    * individual_docs  : 1000
    * n                : 1
    * q                : 1
   
    *** Environment
    * Nodes        : 1
    * Bench ver.   : 1
    * N            : 1
    * Q            : 1
    * OS           : unix/linux
   ```
   
   Each case ran 5 times and picked the best rate in ops/sec, so higher is 
better:
   
   ```
                                                   Default  CFile
   
   * Add 100000 docs, ok:100/accepted:0     (Hz):   16000    16000
   * Get random doc 100000X                 (Hz):    4900     5800
   * All docs                               (Hz):  120000   140000
   * All docs w/ include_docs               (Hz):   24000    31000
   * Changes                                (Hz):   49000    51000
   * Single doc updates 1000X               (Hz):     380      410
   ```
   
   [1] https://www.man7.org/linux/man-pages/man2/pread.2.html
   [2] 
https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
   [3] https://github.com/saleyn/emmap
   [4] https://www.man7.org/linux/man-pages/man2/dup.2.html
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to