[bareos-users] Re: Recurring SD Errors

Paul Simmons Thu, 08 Aug 2024 11:06:13 -0700

Addendum: below is the error log taken from the FD which mentions a 
segmentation violation during the data stream from FD to SD and this 
appears to be a result of a malformed response from Ceph *perf_stats.py* on 
the storage medium and the Bareos *append.cc* has a while statement that 
doesn't seem to account for interrupts in the data stream from malformed 
responses and thus throws an error indicating segmentation violation. This 
results in the Job failing and entire rescheduling and rerun of the Job. 
This may be a possible bug with Bareos, should I move this ticket over to 
bug tracking at Bareos GitHub page?


This was log messages from the FD during the Full backup Job running in the 
STDOUT posted in my original comment.

Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got 
signal 11 - Segmentation violation. Attempting traceback.
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/
Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: /usr/sbin/btraceback 
/usr/sbin/bareos-fd 1309 /var/lib/bareos
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback 
worked...
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: 
/var/lib/bareos/pebbles-fd1.1309.bactrace
Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 
1000067e647.fffffffffffffffe failed, err=-512

Error message from the Ceph manager, pebbles01 is one of the storage 
servers within the Ceph cluster where the Volumes are stored on CephFS as a 
POSIX filesystem.

Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
Exception in thread Thread-126185:
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
Traceback (most recent call last):
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
 File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
 self.run()
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
 File "/lib64/python3.6/threading.py", line 1203, in run
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
 self.function(*self.args, **self.kwargs)
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
 File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in 
re_register_queries
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
 if self.mx_last_updated >= ua_last_updated:
Jul 31 22:28:29 pebbles01 
ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'

Could be relevant to this issue:
https://tracker.ceph.com/issues/65073
This can happen when FSPerfStats.re_register_queries is called before 
mgr/stats can process a single mds report.

These lines from perf_stats.py shows that a malformed response could 
potentially be sent to Bareos...

    def re_register_queries(self, rank0_gid, ua_last_updated):
        #reregister queries if the metrics are the latest. Otherwise 
reschedule the timer and
        #wait for the empty metrics
        with self.lock:
            if self.mx_last_updated >= ua_last_updated:
                self.log.debug("reregistering queries...")
                self.module.reregister_mds_perf_queries()
                self.prev_rank0_gid = rank0_gid
            else:
                #reschedule the timer
                self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL,
                                     self.re_register_queries, 
args=(rank0_gid, ua_last_updated,))
                self.rqtimer.start()

- Paul Simmons
On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote:

> Hello,
>
> I manage and configure my organization's Bareos backup system, which 
> backs-up millions of files totaling ~350 TB of data from an NFS mounted on 
> the Bareos server and stores the data in Volumes on a CephFS also mounted 
> on the Bareos server and uses disk-based storage for the Volumes unlike the 
> tape library we used previously.
>
> Over the last several months, bareos-sd has been encountering recurring 
> errors during Incremental and Full Jobs in which the Jobs fail with a fatal 
> SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a 
> month ago, but it hasn't seemed to resolve the errors. Below is the errors 
> from one of the joblogs:
>
>  2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs 
> associated with Volume "Full-3418". Marking it purged.
>
>  2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from 
> Volume "Full-3418"; marking it "Purged"
>
>  2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418"
>
>  2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" on 
> device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous data 
> lost.
>
>  2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" mounted 
> on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at 31-Jul-2024 
> 02:16.
>
>  2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch 
> table with 800001 entries start
>
>  2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch 
> table done
>
>  2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: 
> stored/append.cc:447 Error reading data header from FD. ERR=No data 
> available
>
>  2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error 
> with FD during Backup: ERR=No data available
>
>  2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device 
> "Full-device0012" (/mnt/bareosfs/backups/Fulls/).
>
>  2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, 
> Transfer rate=14.96 M Bytes/second
>
>  2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status 
> returned from FD.
>
>  2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch 
> table with 384090 entries start
>
>  2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch 
> table done
>
>  2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir 
> 23.0.4~pre74.8cb0a0c26
>
> Any assistance is troubleshooting this is greatly appreciated. I can 
> provide any configurations and other info as necessary, minus any IP 
> addresses or other confidential info.
>
>
> - Paul Simmons
>

-- 
You received this message because you are subscribed to the Google Groups 
"bareos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bareos-users/268a32a6-307e-44b5-a7cc-8a8baa094408n%40googlegroups.com.

[bareos-users] Re: Recurring SD Errors

Reply via email to