https://bugzilla.redhat.com/show_bug.cgi?id=1059913

            Bug ID: 1059913
           Summary: Race condition creating .erlang.cookie
           Product: Fedora
           Version: 20
         Component: rabbitmq-server
          Assignee: hubert.plocinic...@gmail.com
          Reporter: jecke...@redhat.com
        QA Contact: extras...@fedoraproject.org
                CC: erlang@lists.fedoraproject.org,
                    hubert.plocinic...@gmail.com, lemen...@gmail.com,
                    skott...@redhat.com



Description of problem:

There is a race condition when starting rabbitmq-server for the first time.

When the erlang runtime starts, it tries to read its cookie file (for rabbitmq,
/var/lib/rabbitmq/.erlang.cookie) and if it doesn't already exist, it generates
a new random cookie and creates the file.

The following two lines from the rabbitmq-service.service unit file are
involved:

ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid

The rabbitmq-server command returns before the service is up.  Therefor it is
required to exec the additional rabbitmqctl wait in order to make sure the
service starts all the way.  However both of these are erlang programs, and
they share the cookie startup code described previously.

There is variance on the order of events and the eventual error.  But generally
what happens is:

- ExecStart (rabbitmq-server) is run and exits.  The erlang runtime is now
booting in the background.

- ExecStartPost (rabbitmqctl) is run.

- rabbitmq-server determines the cookie file is not present, and generates a
new cookie.

- rabbitmqctl determines the cookie file is not present, and generates a new
cookie.

- rabbitmq-server writes the new cookie to disk and sets the file to read-only

- rabbitmqctl tries to open the cookie file read/write in order to write its
cookie, but errors with EACCESS because the file already exists and is read
only.

- The erlang runtime for rabbitmqctl crashes and the command returns with a
non-successful exit code.

- The entire service unit is marked as failed, and all of the processes are
killed by systemd.


Version-Release number of selected component (if applicable):

rabbitmq-server-3.1.5-1.fc20.noarch
erlang-R16B-03.1.fc20.x86_64

How reproducible:

There is some variability since it's a race.  I've provided my reproducer below
that works 100% of the time for me, inside a F20 VM.  In theory this behavior
should still exist if starting the service from the cli instead of rebooting,
but I can't reproduce it that way.

Steps to Reproduce:
1. install rabbitmq-server
2. systemctl enable rabbitmq-server.service
3. reboot

Actual results:
Service fails to start, see attachment of journalctl output for error

Expected results:
Service starts cleanly

Additional info:

This is really an erlang bug, but the workaround for rabbit is simple (I'll
post a patch in a followup).  I'll run down the erlang bit separately but it
will take longer, so it makes sense to apply the workaround here until erlang
is fixed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
erlang mailing list
erlang@lists.fedoraproject.org
https://lists.fedoraproject.org/mailman/listinfo/erlang

Reply via email to