Thanks Paul! Good sleuthing. We'll get it fixed,
Adam
On Jun 10, 2010, at 11:43 AM, Paul Bonser wrote:
> Ok, so I've tracked it down to the specific location where it happens
>
> - couch_rep_reader:spawn_document_request/2 is called
> - in the SpawnFun defined in there, it calls couch_rep_reader:open_doc
> - open_doc gets an error, not_found response (not sure why, shouldn't the
> doc be there already?)
> - open_doc returns [] back to the SpawnFun
> - SpawnFun calls gen_server:call(Server, {add_docs, nil, Results}... with
> Results being []
> - handle_call(add_docs) calls handle_add_docs, which increments the document
> count..by 0..
> - and then returns {noreply,...}
> - then everything just sits there, because each part is waiting for another
> part to do something
>
> It seems the solution here is to either add a retry into
> spawn_document_request's SpawnFun, or at the very least, fail when open_doc
> returns [], rather than continuing on, since that results in a set of
> deadlocked processes.
>
> On Thu, Jun 10, 2010 at 9:28 AM, Paul Bonser <[email protected]> wrote:
>
>> Nope, just a regular 7200RPM SATA drive.
>>
>> So you guys may already know tihs, but I've tracked it down to a couch_rep
>> gen_server never terminating, and thus not calling do_terminate, and thus
>> the call to gen_server:call(Server, get_result, infinity) in
>> couch_rep:get_result just hangs forever.
>>
>>
>> On Thu, Jun 10, 2010 at 4:39 AM, Jan Lehnardt <[email protected]> wrote:
>>
>>> Hi Paul,
>>>
>>> thanks for the report. Out of curiosity, are you running an SSD drive in
>>> the box that reproduces the hangs?
>>>
>>> And anyone: Can you reproduce this on non-SSD machines?
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>> On 10 Jun 2010, at 02:26, Paul Bonser wrote:
>>>
>>>> Oh, I should also mention that I got the exact same error in multiple
>>>> freezes. Twice it was in the same exact order, and once it was in this
>>>> order:
>>>>
>>>> [info] [<0.95.0>] starting replication
>>> "15c25eda4ea6308af6bea9864d5319ef" at
>>>> <0.1845.0>
>>>> [debug] [<0.1207.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>> [info] [<0.1207.0>] 127.0.0.1 - - 'GET'
>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>>>> [debug] [<0.1207.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1}
>>>> Headers: [{'Accept',"application/json"},
>>>> {'Accept-Encoding',"gzip"},
>>>> {'Content-Length',"167"},
>>>> {'Host',"localhost:5985"},
>>>> {'User-Agent',"CouchDB/0.12.0a953193"},
>>>> {"X-Couch-Full-Commit","false"}]
>>>> [debug] [<0.1207.0>] OAuth Params: []
>>>> [info] [<0.1207.0>] 127.0.0.1 - - 'POST'
>>>> /test_suite_rep_docs_db_b/_bulk_docs 201
>>>> [debug] [<0.1076.0>] 'GET'
>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>>>> Headers: [{'Accept',"application/json"},
>>>> {'Accept-Encoding',"gzip"},
>>>> {'Host',"localhost:5985"},
>>>> {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>> [debug] [<0.1076.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>> [debug] [<0.1076.0>] Minor error in HTTP request: {not_found,missing}
>>>> [debug] [<0.1076.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>>>> {couch_httpd_db,db_doc_req,3},
>>>> {couch_httpd_db,do_db_req,2},
>>>> {couch_httpd,handle_request_int,5},
>>>> {mochiweb_http,headers,5},
>>>> {proc_lib,init_p_do_apply,3}]
>>>> [info] [<0.1076.0>] 127.0.0.1 - - 'GET'
>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>>>> [debug] [<0.1076.0>] httpd 404 error response:
>>>> {"error":"not_found","reason":"missing"}
>>>>
>>>>
>>>> Could it be some sort of race condition?
>>>>
>>>>
>>>>
>>>> On Wed, Jun 9, 2010 at 8:22 PM, Paul Bonser <[email protected]>
>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 9, 2010 at 7:33 PM, J Chris Anderson <[email protected]
>>>> wrote:
>>>>>
>>>>>> Devs,
>>>>>>
>>>>>> Is anyone else seeing the replicator test hang and never finish?
>>>>>>
>>>>>> It never hangs the first few runs, but after running ten or so times,
>>> I'll
>>>>>> end up with the test suite waiting for a replication that never
>>> finishes.
>>>>>> It's the same story on 0.11.0, 0.11.x, and trunk.
>>>>>>
>>>>>> Is anyone else able to reproduce this? Am I crazy?
>>>>>>
>>>>>
>>>>> It just froze for me on the first try, using 0.12.0a935298, then ran
>>>>> successfully 3 times, then froze again. The last thing logged the first
>>> time
>>>>> was a _bulk_docs requests, the last thing logged this time was a PUT to
>>>>> /test_suite_db_b/_local%2F6598a76aa55cd39645e4730b4cb28c00
>>>>>
>>>>> I'm running a Firefox 3.6 nightly build on Linux. For me, it froze the
>>>>> first time when I did a "run all" and the second time when just
>>> directly
>>>>> running the replication test.
>>>>>
>>>>> After svn up-ing to the latest in trunk, it froze on the first try,
>>>>> directly running the replication test.
>>>>>
>>>>> Here's the debug output for the last _replicate request where it
>>> freezes.
>>>>> It's requesting a document that isn't there.
>>>>>
>>>>>
>>>>> [info] [<0.95.0>] starting new replication
>>>>> "15c25eda4ea6308af6bea9864d5319ef" at <0.848.0>
>>>>> [debug] [<0.191.0>] 'GET'
>>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true {1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>> {'Accept-Encoding',"gzip"},
>>>>> {'Host',"localhost:5985"},
>>>>> {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>>> [debug] [<0.191.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>>> [info] [<0.191.0>] 127.0.0.1 - - 'GET'
>>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>>>>> [debug] [<0.189.0>] 'GET'
>>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>> {'Accept-Encoding',"gzip"},
>>>>> {'Host',"localhost:5985"},
>>>>> {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>>> [debug] [<0.189.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>>> [debug] [<0.189.0>] Minor error in HTTP request: {not_found,missing}
>>>>> [debug] [<0.189.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>>>>> {couch_httpd_db,db_doc_req,3},
>>>>> {couch_httpd_db,do_db_req,2},
>>>>> {couch_httpd,handle_request_int,5},
>>>>> {mochiweb_http,headers,5},
>>>>> {proc_lib,init_p_do_apply,3}]
>>>>> [info] [<0.189.0>] 127.0.0.1 - - 'GET'
>>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>>>>> [debug] [<0.189.0>] httpd 404 error response:
>>>>> {"error":"not_found","reason":"missing"}
>>>>>
>>>>> [debug] [<0.191.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>> {'Accept-Encoding',"gzip"},
>>>>> {'Content-Length',"167"},
>>>>> {'Host',"localhost:5985"},
>>>>> {'User-Agent',"CouchDB/0.12.0a953193"},
>>>>> {"X-Couch-Full-Commit","false"}]
>>>>> [debug] [<0.191.0>] OAuth Params: []
>>>>> [info] [<0.191.0>] 127.0.0.1 - - 'POST'
>>>>> /test_suite_rep_docs_db_b/_bulk_docs 201
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Paul Bonser
>>>>> http://probablyprogramming.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Paul Bonser
>>>> http://probablyprogramming.com
>>>
>>>
>>
>>
>> --
>> Paul Bonser
>> http://probablyprogramming.com
>>
>
>
>
> --
> Paul Bonser
> http://probablyprogramming.com