Generating views is extremely slow - makes CouchDB hard to use with non-trivial 
number of docs
----------------------------------------------------------------------------------------------

                 Key: COUCHDB-620
                 URL: https://issues.apache.org/jira/browse/COUCHDB-620
             Project: CouchDB
          Issue Type: Improvement
          Components: Infrastructure
    Affects Versions: 0.10
         Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
            Reporter: Roger Binns


Generating views is extremely slow.  For example adding 10 million documents 
takes less than 10 minutes but generating some simple views on the same docs 
takes over 4 hours.

Using top you can see that CouchDB (erlang) and couchjs between them cannot 
even saturate a single CPU let alone the I/O system.  Under ideal conditions 
performance should be limited by cpu, disk or memory.  This implies that the 
processes are doing simple things in lockstep accumulating latencies in each 
process as well as the communication between them which when multiplied by the 
number of documents can amount to a lot.

Some suggestions:

* Run as many couchjs instances as there are processor cores and scatter work 
amongst them

* Have some sort of pipelining in the erlang so that the moment the first byte 
of response is received from couchjs the data is sent for the next request (the 
JSON conversion, HTTP headers etc should all have been assembled already) to 
reduce latencies.  Do whatever is most similar in couchjs (eg use separate 
threads to read requests, process them and write responses).

* Use the equivalent of HTTP pipelining when talking to couchjs so that it 
always has a doc ready to work on rather than having to transmit an entire 
response and then wait for erlang to think and provide an entire new request

A simple test of success is to have a database with a million or so documents 
with a trivial view and have view creation max out the CPU,. memory or disk.

Some things in CouchDB make this a particularly nasty problem.  View data is 
not replicated so replicating documents can lead the view data by a large 
margin on the recipient database.  This can lead to inconsistencies.  You also 
can't expect users to then wait minutes (or hours) for a request to complete 
because the view generation got that far behind.  (My own plans now are to not 
use replication and instead create the database file on another couchdb 
instance and then rsync the binary database file over instead!)

Although stale=ok is available, you still have no idea if the response will be 
quick or take however long view generation does.  (Sure I could add some sort 
of timeout and complicate the code but then what value do I pick?  If I have a 
user waiting I want an answer ASAP or I have to give them some horrible error 
message.  Taking a long wait and then giving a timeout is even worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to