[
https://issues.apache.org/jira/browse/COUCHDB-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ciprian Trusca updated COUCHDB-2496:
------------------------------------
Description:
We have the following setup:
* two CouchDB machines with replication enabled between them
* a watchdog running every 5 minutes, which verifies the status of the
_replicator documents. If one of those documents has _replication_state =
error, the watchdog deletes it and creates a new one with the exact same
parameters.
For this test, one CouchDB machine is shut down so the watchdog will
continuously recreate the _replicator documents, and that will cause the
_replicator database to get fragmented.
Several times the couch.log state that this database is fragmented over the 70%
threshold, but then there isn't any evidence that the compaction for the
_replicator database is started. Instead, after approximately we get the
following error
{code}
** Reason for termination ==
** {compaction_loop_died,
{timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
{code}
The worse part is that, from time to time, the error appears several times in a
short interval of time (eg. 3 times / 60 seconds) and this causes the whole
CouchDB server to crash with:
{code}
[error] [<0.93.0>] {error_report,<0.30.0>,
{<0.93.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,
[{pid,<0.10114.14>},
{name,compaction_daemon},
{mfargs,{couch_compaction_daemon,start_link,[]}},
{restart_type,permanent},
{shutdown,brutal_kill},
{child_type,worker}]}]}}
{code}
All the subsequent requests to CouchDb are then refused for a period of time (
we measured between 3 and 50 minutes).
Because this is a heavy load test we isolated CouchDb in a ramdisk in order to
make sure that this is not a disk usage problem, but the error persists
Please let me know if additional information is required.
Thank you.
was:
We have the following setup:
* two CouchDB machines with replication enabled between them
* a watchdog running every 5 minutes, which verifies the status of the
_replicator documents. If one of those documents has _replication_state =
error, the watchdog deletes it and creates a new one with the exact same
parameters.
For this test, one CouchDB machine is shut down so the watchdog will
continuously recreate the _replicator documents, and that will cause the
_replicator database to get fragmented.
Several times the couch.log state that this database is fragmented over the 70%
threshold, but then there isn't any evidence that the compaction for the
_replicator database is started. Instead, after approximately we get the
following error
{code}
** Reason for termination ==
** {compaction_loop_died,
{timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
{code}
The worse part is that, from time to time the error appears several times in a
short interval of time (eg. 3 times / 60 seconds) and this causes the whole
CouchDB server to crash with:
{code}
[error] [<0.93.0>] {error_report,<0.30.0>,
{<0.93.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,
[{pid,<0.10114.14>},
{name,compaction_daemon},
{mfargs,{couch_compaction_daemon,start_link,[]}},
{restart_type,permanent},
{shutdown,brutal_kill},
{child_type,worker}]}]}}
{code}
All the subsequent requests to CouchDb are then refused for a period of time (
we measured between 3 and 50 minutes).
Because this is a heavy load test we isolated CouchDb in a ramdisk in order to
make sure that this is not a disk usage problem, but the error persists
We are running CouchDB 1.6.1 on a Centos 6.4 machine.
Please let me know if additional information is required.
Thank you.
Environment: Centos 6.4
Affects Version/s: 1.6.1
> compaction repeated timeouts causes the server to shutdown temporary when
> replication is broken
> -----------------------------------------------------------------------------------------------
>
> Key: COUCHDB-2496
> URL: https://issues.apache.org/jira/browse/COUCHDB-2496
> Project: CouchDB
> Issue Type: Bug
> Security Level: public(Regular issues)
> Components: Database Core
> Affects Versions: 1.6.1
> Environment: Centos 6.4
> Reporter: Ciprian Trusca
>
> We have the following setup:
> * two CouchDB machines with replication enabled between them
> * a watchdog running every 5 minutes, which verifies the status of the
> _replicator documents. If one of those documents has _replication_state =
> error, the watchdog deletes it and creates a new one with the exact same
> parameters.
> For this test, one CouchDB machine is shut down so the watchdog will
> continuously recreate the _replicator documents, and that will cause the
> _replicator database to get fragmented.
> Several times the couch.log state that this database is fragmented over the
> 70% threshold, but then there isn't any evidence that the compaction for the
> _replicator database is started. Instead, after approximately we get the
> following error
> {code}
> ** Reason for termination ==
> ** {compaction_loop_died,
> {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
> {code}
> The worse part is that, from time to time, the error appears several times in
> a short interval of time (eg. 3 times / 60 seconds) and this causes the whole
> CouchDB server to crash with:
> {code}
> [error] [<0.93.0>] {error_report,<0.30.0>,
> {<0.93.0>,supervisor_report,
> [{supervisor,{local,couch_secondary_services}},
> {errorContext,shutdown},
> {reason,reached_max_restart_intensity},
> {offender,
> [{pid,<0.10114.14>},
> {name,compaction_daemon},
>
> {mfargs,{couch_compaction_daemon,start_link,[]}},
> {restart_type,permanent},
> {shutdown,brutal_kill},
> {child_type,worker}]}]}}
> {code}
>
> All the subsequent requests to CouchDb are then refused for a period of time
> ( we measured between 3 and 50 minutes).
> Because this is a heavy load test we isolated CouchDb in a ramdisk in order
> to make sure that this is not a disk usage problem, but the error persists
> Please let me know if additional information is required.
> Thank you.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)