[bareos-devel] race condition causing duplicate job names?

Anthony Vaccaro Thu, 26 Dec 2019 23:41:57 -0800

Hi all,

I've been tracking down an issue we've been seeing on our bareos install, 
where restore jobs submitted at the same time end up with the same job 
name, which results in one of the jobs being rejected by the storage daemon 
and failing.

I think i've found the issue in the source code, inside the
CreateUniqueJobName function:

https://github.com/bareos/bareos/blob/1417cff723cca48eccba156eca8fd90b99fbe122/core/src/dird/job.cc#L1503

the seq variable is incremented inside of the mutex, which should be safe,
but then its value is read into the JobControlRecord outside of the mutex,
which is a race condition if other threads are manipulating the value at
the same time.

I've written a short program to verify this (for my own understanding as
much as anyone else's), where i've also attempted to fix the issue by
assigning seq to a non-static (ie. thread-local) variable inside of the
mutex, and then using that for the printf statement outside of the mutex.
It seems to work.

https://gist.github.com/WaryWolf/ea7d524f96725d823aae5d96a3727442

I'm happy to submit a PR for this, I just wanted to confirm via the mailing
list first as i'm relatively new to this project.

I've attached a sample from our bareos.log showing this bug in action. I've
stripped out some unnecessary lines, but hopefully the log still makes
sense - the AfterJob script for jobs 198059 and 198072 creates jobs 198166
and 198168 respectively, which both have the name
"archive.2019-12-25_18.34.31_17". 198166 starts successfully, but when
198168 starts, the storage daemon rejects it due to a job with the same
name already being authenticated.

Thanks and regards,

Anthony Vaccaro

--
You received this message because you are subscribed to the Google Groups
"bareos-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to bareos-devel+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bareos-devel/051989ea-b9d8-4186-8439-9a3a96db85c4%40googlegroups.com.

25-Dec 18:34 backuphost JobId 198072: shell command: run AfterJob 
"/usr/lib/bareos/scripts/barchive --client=host1 --jobid=198072 --pool=daily 
--fileset=host1:radius_logs --restorejob=archive"
25-Dec 18:34 backuphost JobId 198059: shell command: run AfterJob 
"/usr/lib/bareos/scripts/barchive --client=host2 --jobid=198059 --pool=daily 
--fileset=host2:security_logs --restorejob=archive"
25-Dec 19:04 backuphost JobId 198166: Start Restore Job 
archive.2019-12-25_18.34.31_17
25-Dec 19:04 backuphost JobId 198166: Using Device "File1Storage38" to read.
25-Dec 19:04 backuphost-sd JobId 198166: Ready to read from volume 
"daily-000704.vol" on device "File1Storage38" (/backup).
25-Dec 19:04 backuphost-sd JobId 198166: autoxflate-sd: Compressor on device 
File1Storage01 is LZ4HC
25-Dec 19:04 backuphost-sd JobId 198166: autoxflate-sd: File1Storage01 
OUT:[SD->inflate=yes->deflate=yes->DEV] IN:[DEV->inflate=yes->deflate=no->SD]
25-Dec 19:04 backuphost-sd JobId 198166: Forward spacing Volume 
"daily-000704.vol" to file:block 0:1149478163.
25-Dec 19:04 backuphost-sd JobId 198166: End of Volume at file 0 on device 
"File1Storage38" (/backup), Volume "daily-000704.vol"
25-Dec 19:04 backuphost-sd JobId 198166: Warning: vol_mgr.c:557 Need volume 
from other drive, but swap not possible. Status: read=1 num_writers=0 
num_reserve=0 swap=0 vol=daily-000620.vol from dev="File1Storage39" (/backup) 
to "File1Storage38" (/backup)
25-Dec 19:04 backuphost-sd JobId 198166: Warning: acquire.c:294 Read acquire: 
label.c:254 Could not reserve volume daily-000620.vol on "File1Storage38" 
(/backup)
25-Dec 19:04 backuphost-sd JobId 198166: Please mount read Volume 
"daily-000620.vol" for:
    Job:          archive.2019-12-25_18.34.31_17
    Storage:      "File1Storage38" (/backup)
    Pool:         default
    Media type:   File1

25-Dec 19:08 backuphost JobId 198168: Start Restore Job 
archive.2019-12-25_18.34.31_17
25-Dec 19:08 backuphost JobId 198168: Using Device "File1Storage01" to read.
25-Dec 19:08 backuphost-sd JobId 198166: Fatal error: fd_cmds.c:124 Hey!!!! 
JobId 198166 Job archive.2019-12-25_18.34.31_17 already authenticated.
25-Dec 19:08 backuphost JobId 198168: Fatal error: Authorization key rejected 
by Storage daemon .
25-Dec 19:08 backuphost JobId 198168: Fatal error: Authorization key rejected 
by Storage daemon .
25-Dec 19:08 backuphost JobId 198168: Fatal error: Failed to authenticate 
Storage daemon.
25-Dec 19:08 backuphost JobId 198168: Fatal error: Failed to authenticate 
Storage daemon.
25-Dec 19:08 backuphost JobId 198168: Fatal error: Bad response to Storage 
command: wanted 2000 OK storage
, got 2902 Bad storage

25-Dec 19:08 backuphost-sd JobId 198166: acquire.c:221 Job 198166 canceled.
25-Dec 19:08 backuphost JobId 198168: Error: Bareos backuphost 16.2.7 (09Oct17):
  Build OS:               x86_64-redhat-linux-gnu redhat CentOS Linux release 
7.4.1708 (Core) 
  JobId:                  198168
  Job:                    archive.2019-12-25_18.34.31_17
  Restore Client:         backuphost
  Start time:             25-Dec-2019 19:08:40
  End time:               25-Dec-2019 19:08:45
  Elapsed time:           5 secs
  Files Expected:         32
  Files Restored:         0
  Bytes Restored:         0
  Rate:                   0.0 KB/s
  FD Errors:              1
  FD termination status:  
  SD termination status:  Waiting on FD
  Termination:            *** Restore Error ***

25-Dec 19:08 backuphost JobId 198168: Error: Bareos backuphost 16.2.7 (09Oct17):
  Build OS:               x86_64-redhat-linux-gnu redhat CentOS Linux release 
7.4.1708 (Core) 
  JobId:                  198168
  Job:                    archive.2019-12-25_18.34.31_17
  Restore Client:         backuphost
  Start time:             25-Dec-2019 19:08:40
  End time:               25-Dec-2019 19:08:45
  Elapsed time:           5 secs
  Files Expected:         32
  Files Restored:         0
  Bytes Restored:         0
  Rate:                   0.0 KB/s
  FD Errors:              2
  FD termination status:  
  SD termination status:  Waiting on FD
  Termination:            *** Restore Error ***

25-Dec 19:08 backuphost-sd JobId 198166: Fatal error: mount.c:943 Cannot open 
Dev="File1Storage38" (/backup), Vol=daily-000620.vol
25-Dec 19:08 backuphost-sd JobId 198166: End of all volumes.
25-Dec 19:08 backuphost-sd JobId 198166: Error: bsock_tcp.c:442 Socket is 
terminated=1 on call to client:10.100.1.25:9103

[bareos-devel] race condition causing duplicate job names?

Reply via email to