[
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798612#action_12798612
]
Roger Binns commented on COUCHDB-620:
-------------------------------------
There is no parallelizing at all. My comment does say that :-)
Although doing it in parallel (or by number of CPU cores) will improve things,
there is still lots more to be done. (Going from 4 hours to 2 hours is still
far too long.) The various latencies add up a lot. While CouchDB is
considering which document to get view info for next, the couchjs process is
sitting there idle. While couchjs is processing a doc, CouchDB sits there
idle. Each side doing work while the other side is also working will
significantly reduce the latencies and increase document processing throughput.
> Generating views is extremely slow - makes CouchDB hard to use with
> non-trivial number of docs
> ----------------------------------------------------------------------------------------------
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
> Issue Type: Improvement
> Components: Infrastructure
> Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
> Reporter: Roger Binns
>
> Generating views is extremely slow. For example adding 10 million documents
> takes less than 10 minutes but generating some simple views on the same docs
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot
> even saturate a single CPU let alone the I/O system. Under ideal conditions
> performance should be limited by cpu, disk or memory. This implies that the
> processes are doing simple things in lockstep accumulating latencies in each
> process as well as the communication between them which when multiplied by
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first
> byte of response is received from couchjs the data is sent for the next
> request (the JSON conversion, HTTP headers etc should all have been assembled
> already) to reduce latencies. Do whatever is most similar in couchjs (eg use
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it
> always has a doc ready to work on rather than having to transmit an entire
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem. View data is
> not replicated so replicating documents can lead the view data by a large
> margin on the recipient database. This can lead to inconsistencies. You
> also can't expect users to then wait minutes (or hours) for a request to
> complete because the view generation got that far behind. (My own plans now
> are to not use replication and instead create the database file on another
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will
> be quick or take however long view generation does. (Sure I could add some
> sort of timeout and complicate the code but then what value do I pick? If I
> have a user waiting I want an answer ASAP or I have to give them some
> horrible error message. Taking a long wait and then giving a timeout is even
> worse!)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.