Addendum: below is the error log taken from the FD which mentions a segmentation violation during the data stream from FD to SD and this appears to be a result of a malformed response from Ceph *perf_stats.py* on the storage medium and the Bareos *append.cc* has a while statement that doesn't seem to account for interrupts in the data stream from malformed responses and thus throws an error indicating segmentation violation. This results in the Job failing and entire rescheduling and rerun of the Job. This may be a possible bug with Bareos, should I move this ticket over to bug tracking at Bareos GitHub page?
This was log messages from the FD during the Full backup Job running in the STDOUT posted in my original comment. Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got signal 11 - Segmentation violation. Attempting traceback. Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/ Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback worked... Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: /var/lib/bareos/pebbles-fd1.1309.bactrace Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 1000067e647.fffffffffffffffe failed, err=-512 Error message from the Ceph manager, pebbles01 is one of the storage servers within the Ceph cluster where the Volumes are stored on CephFS as a POSIX filesystem. Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Exception in thread Thread-126185: Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Traceback (most recent call last): Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: self.run() Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/lib64/python3.6/threading.py", line 1203, in run Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: self.function(*self.args, **self.kwargs) Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in re_register_queries Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: if self.mx_last_updated >= ua_last_updated: Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated' Could be relevant to this issue: https://tracker.ceph.com/issues/65073 This can happen when FSPerfStats.re_register_queries is called before mgr/stats can process a single mds report. These lines from perf_stats.py shows that a malformed response could potentially be sent to Bareos... def re_register_queries(self, rank0_gid, ua_last_updated): #reregister queries if the metrics are the latest. Otherwise reschedule the timer and #wait for the empty metrics with self.lock: if self.mx_last_updated >= ua_last_updated: self.log.debug("reregistering queries...") self.module.reregister_mds_perf_queries() self.prev_rank0_gid = rank0_gid else: #reschedule the timer self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL, self.re_register_queries, args=(rank0_gid, ua_last_updated,)) self.rqtimer.start() - Paul Simmons On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote: > Hello, > > I manage and configure my organization's Bareos backup system, which > backs-up millions of files totaling ~350 TB of data from an NFS mounted on > the Bareos server and stores the data in Volumes on a CephFS also mounted > on the Bareos server and uses disk-based storage for the Volumes unlike the > tape library we used previously. > > Over the last several months, bareos-sd has been encountering recurring > errors during Incremental and Full Jobs in which the Jobs fail with a fatal > SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a > month ago, but it hasn't seemed to resolve the errors. Below is the errors > from one of the joblogs: > > 2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs > associated with Volume "Full-3418". Marking it purged. > > 2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from > Volume "Full-3418"; marking it "Purged" > > 2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418" > > 2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" on > device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous data > lost. > > 2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" mounted > on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at 31-Jul-2024 > 02:16. > > 2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch > table with 800001 entries start > > 2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch > table done > > 2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: > stored/append.cc:447 Error reading data header from FD. ERR=No data > available > > 2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error > with FD during Backup: ERR=No data available > > 2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device > "Full-device0012" (/mnt/bareosfs/backups/Fulls/). > > 2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, > Transfer rate=14.96 M Bytes/second > > 2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status > returned from FD. > > 2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch > table with 384090 entries start > > 2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch > table done > > 2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir > 23.0.4~pre74.8cb0a0c26 > > Any assistance is troubleshooting this is greatly appreciated. I can > provide any configurations and other info as necessary, minus any IP > addresses or other confidential info. > > > - Paul Simmons > -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/268a32a6-307e-44b5-a7cc-8a8baa094408n%40googlegroups.com.
