[ 
https://issues.apache.org/jira/browse/COUCHDB-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ciprian Trusca updated COUCHDB-2496:
------------------------------------
    Description: 
We have the following setup:

* two CouchDB machines with replication enabled between them 
* a watchdog running every 5 minutes, which verifies the status of the 
_replicator documents. If one of those documents has _replication_state = 
error, the watchdog deletes it and creates a new one with the exact same 
parameters.

For this test, one CouchDB machine is shut down so the watchdog will 
continuously recreate the _replicator documents, and that will cause the 
_replicator database to get fragmented. 

Several times the couch.log state that this database is fragmented over the 70% 
threshold, but then there isn't any evidence that the compaction for the 
_replicator database is started. Instead, after approximately 5 seconds we get 
the following error
{code}
** Reason for termination ==
 ** {compaction_loop_died, 
       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
{code}

The worse part is that, from time to time, the error appears several times in a 
short interval of time (eg. 3 times / 60 seconds) and this causes the whole 
CouchDB server to crash with:
{code}
[error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}
{code}
 
All the subsequent requests to CouchDb are then refused for a period of time ( 
we measured between 3 and 50 minutes). 

Because this is a heavy load test we isolated CouchDb in a ramdisk in order to 
make sure that this is not a disk usage problem, but the error persists

Please let me know if additional information is required. 
Thank you.




  was:
We have the following setup:

* two CouchDB machines with replication enabled between them 
* a watchdog running every 5 minutes, which verifies the status of the 
_replicator documents. If one of those documents has _replication_state = 
error, the watchdog deletes it and creates a new one with the exact same 
parameters.

For this test, one CouchDB machine is shut down so the watchdog will 
continuously recreate the _replicator documents, and that will cause the 
_replicator database to get fragmented. 

Several times the couch.log state that this database is fragmented over the 70% 
threshold, but then there isn't any evidence that the compaction for the 
_replicator database is started. Instead, after approximately we get the 
following error
{code}
** Reason for termination ==
 ** {compaction_loop_died, 
       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
{code}

The worse part is that, from time to time, the error appears several times in a 
short interval of time (eg. 3 times / 60 seconds) and this causes the whole 
CouchDB server to crash with:
{code}
[error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}
{code}
 
All the subsequent requests to CouchDb are then refused for a period of time ( 
we measured between 3 and 50 minutes). 

Because this is a heavy load test we isolated CouchDb in a ramdisk in order to 
make sure that this is not a disk usage problem, but the error persists

Please let me know if additional information is required. 
Thank you.





> compaction repeated timeouts causes the server to shutdown temporary when 
> replication is broken
> -----------------------------------------------------------------------------------------------
>
>                 Key: COUCHDB-2496
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2496
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Database Core
>    Affects Versions: 1.6.1
>         Environment: Centos 6.4
>            Reporter: Ciprian Trusca
>
> We have the following setup:
> * two CouchDB machines with replication enabled between them 
> * a watchdog running every 5 minutes, which verifies the status of the 
> _replicator documents. If one of those documents has _replication_state = 
> error, the watchdog deletes it and creates a new one with the exact same 
> parameters.
> For this test, one CouchDB machine is shut down so the watchdog will 
> continuously recreate the _replicator documents, and that will cause the 
> _replicator database to get fragmented. 
> Several times the couch.log state that this database is fragmented over the 
> 70% threshold, but then there isn't any evidence that the compaction for the 
> _replicator database is started. Instead, after approximately 5 seconds we 
> get the following error
> {code}
> ** Reason for termination ==
>  ** {compaction_loop_died, 
>        {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
> {code}
> The worse part is that, from time to time, the error appears several times in 
> a short interval of time (eg. 3 times / 60 seconds) and this causes the whole 
> CouchDB server to crash with:
> {code}
> [error] [<0.93.0>] {error_report,<0.30.0>,
>                        {<0.93.0>,supervisor_report,
>                         [{supervisor,{local,couch_secondary_services}},
>                          {errorContext,shutdown},
>                          {reason,reached_max_restart_intensity},
>                          {offender,
>                              [{pid,<0.10114.14>},
>                               {name,compaction_daemon},
>                               
> {mfargs,{couch_compaction_daemon,start_link,[]}},
>                               {restart_type,permanent},
>                               {shutdown,brutal_kill},
>                               {child_type,worker}]}]}}
> {code}
>  
> All the subsequent requests to CouchDb are then refused for a period of time 
> ( we measured between 3 and 50 minutes). 
> Because this is a heavy load test we isolated CouchDb in a ramdisk in order 
> to make sure that this is not a disk usage problem, but the error persists
> Please let me know if additional information is required. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to