nickva commented on a change in pull request #409: RFC for CouchDB background 
workers
URL: 
https://github.com/apache/couchdb-documentation/pull/409#discussion_r290060587
 
 

 ##########
 File path: rfcs/007-background-jobs.md
 ##########
 @@ -0,0 +1,350 @@
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: 'Background jobs with FoundationDB'
+labels: rfc, discussion
+assignees: ''
+
+---
+
+[NOTE]: # ( ^^ Provide a general summary of the RFC in the title above. ^^ )
+
+# Introduction
+
+This document describes a data model, implementation, and an API for running
+CouchDB background jobs with FoundationDB.
+
+## Abstract
+
+CouchDB background jobs are used for things like index building, replication
+and couch-peruser processing. We present a generalized model which allows
+creation, running, and monitoring of these jobs.
+
+The document starts with a description of the framework API in Erlang
+pseudo-code, then we show the data model, followed by the implementation
+details.
+
+## Requirements Language
+
+[NOTE]: # ( Do not alter the section below. Follow its instructions. )
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
+document are to be interpreted as described in
+[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+---
+
+`Job`: A unit of work, identified by a `JobId` and also having a `Type`.
+
+`Worker` : A language-specific execution unit that runs the job. Could be an
+Erlang process, a thread, or just a function.
+
+`Job table`: An FDB subspace holding the list of jobs.
+
+`Pending job`: A job that is waiting to run.
+
+`Pending queue` : A queue of pending jobs ordered by priority.
+
+`Running job`: A job which is currently executing. To be considered "running"
+the worker must periodically update the job's state in the global job table.
+
+`Priority`: A job's priority specifies its order in the pending queue. Priority
+can by any term that can be encoded as a key in the FoundationDB's tuple 
layer. The
+exact value of `Priority` is job type specific. It MAY be a rough timestamp, a
+`Sequence`, a list of tags, etc.
+
+`Job re-submission` : Re-submitting a job means putting a previously running
+job back into the pending queue.
+
+`Activity monitor` : Functionality implemented by the framework which checks
+job liveness (activity). If workers don't update their status often enough,
+activity monitor will re-enqueue their jobs as pending. This ensures jobs make
+progress even if some workers terminate unexpectedly.
+
+`JobState`: Describes the current state of the job. The possible values are
+`"running"`, `"pending"`, and `"finished"`. These are the minimal number of
+states needed to describe a job's behavior in respect to this framework. Each
+job type MAY have additional, type specific states, such as `"failed`",
+`"error"`, `"retrying"`, etc.
+
+`Sequence`: a 13 byte value formed by combining the current `Incarnation` of
+the database and the `Versionstamp` of the transaction. Sequences are
+monotonically increasing even when a database is relocated across FoundationDB
+clusters. See (RFC002) for a full explanation.
+
+---
+
+# Framework API
+
+This section describes the job creation and worker implementation APIs. It 
doesn't
+describe how the framework is implemented. The intended audience is CouchDB
+developers using this framework to implement background jobs for indexing,
+replication, and couch-peruser.
+
+Both the job creation and the worker implementation APIs use a `JobOpts` map to
+represent a job. It MAY also contain these top level fields:
+
+  * `"priority"` : The value of this field will contain the `Priority` value of
+    the job. `Priority` is job-type specific.
 
 Review comment:
   The priority is simply a sorting subspace to putting some jobs ahead of 
others. It allows accepting some jobs first and not necessarily in the order in 
which they were added.
   
   There are two extreme cases to consider:
   
   1) There is no priority at all (which is the same as saying all jobs are the 
same `null` or `normal` priority). In this case the `queue` becomes a jobs 
`bucket`.  Any worker can pick any job regardless when it was added to the jobs 
pending queue. This setup would also minimize contention as a each could pick 
randomly pick one `JobId`
   
   2) A strictly prioritized queue, with each Job having its own priority 
according the say a version stamp of when it was added.  Jobs are guaranteed to 
be accepted in the order in which they were added. But there will be more 
contention involved when dequeuing them.
   
   The abstract priority as proposed allows having those extremes, and anything 
in between with a trade-off between ordering guarantees and dequeue contention.
   
   Some examples:
   
   Priority could be just two values "0-urgent" and "5-normal". There could be 
10000 urgent and 10000 normal jobs.
   ```
    (..., "0-urgent", "36b40ba869104c3d84a5f6821c04d26d") = ...
    (..., "0-urgent", "c63dfe3a2f8f459da4ce9e444c15fc48") = ...
     ...
    (..., "5-normal", "61b5e60438cc4865a838c675ab1aa076") = ...
    (..., "5-normal", "aa8c364ba05149e29db3ab358d1203c8") = ...
   ```
   
   Workers would first randomly pick through 10000 urgent ones and then only 
when those are accepted, start picking normal jobs.
   
   Another case cold be with using timestamps with say a 5 second resolution vs 
5 minute resolution. In one case there would be more contention, but the timing 
guarattee, but contention would lessen, because now, each worker could randomly 
choose to run any job from now till now + 5 minutes. There could now be 
thousands more jobs to pick from.
   
   > seems like a bit of an overload on what I would traditionally consider as 
a "priority"
   
   We could perhaps rename to something like "sort order" ? The replicator does 
need something prioritize some jobs over the others and it might be useful for 
indexing where say if there are 1000 design documents updated, we might choose 
to start indexing some of them first before others.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to