Will do, thank you. - Paul Simmons
On Thursday, August 8, 2024 at 12:43:54 PM UTC-7 Bruno Friedmann (bruno-at-bareos) wrote: > Please open an issue on github, will be the best place to track it down. > > > On Thursday 8 August 2024 at 20:06:08 UTC+2 Paul Simmons wrote: > >> Addendum: below is the error log taken from the FD which mentions a >> segmentation violation during the data stream from FD to SD and this >> appears to be a result of a malformed response from Ceph *perf_stats.py* >> on the storage medium and the Bareos *append.cc* has a while statement >> that doesn't seem to account for interrupts in the data stream from >> malformed responses and thus throws an error indicating segmentation >> violation. This results in the Job failing and entire rescheduling and >> rerun of the Job. This may be a possible bug with Bareos, should I move >> this ticket over to bug tracking at Bareos GitHub page? >> >> This was log messages from the FD during the Full backup Job running in >> the STDOUT posted in my original comment. >> >> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal >> 11: Segmentation violation >> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal >> 11: Segmentation violation >> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got >> signal 11 - Segmentation violation. Attempting traceback. >> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/ >> Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: >> /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos >> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback >> worked... >> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: >> /var/lib/bareos/pebbles-fd1.1309.bactrace >> Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl >> 1000067e647.fffffffffffffffe failed, err=-512 >> >> Error message from the Ceph manager, pebbles01 is one of the storage >> servers within the Ceph cluster where the Volumes are stored on CephFS as a >> POSIX filesystem. >> >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> Exception in thread Thread-126185: >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> Traceback (most recent call last): >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> self.run() >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> File "/lib64/python3.6/threading.py", line 1203, in run >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> self.function(*self.args, **self.kwargs) >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in >> re_register_queries >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> if self.mx_last_updated >= ua_last_updated: >> Jul 31 22:28:29 pebbles01 >> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: >> AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated' >> >> Could be relevant to this issue: >> https://tracker.ceph.com/issues/65073 >> This can happen when FSPerfStats.re_register_queries is called before >> mgr/stats can process a single mds report. >> >> These lines from perf_stats.py shows that a malformed response could >> potentially be sent to Bareos... >> >> def re_register_queries(self, rank0_gid, ua_last_updated): >> #reregister queries if the metrics are the latest. Otherwise >> reschedule the timer and >> #wait for the empty metrics >> with self.lock: >> if self.mx_last_updated >= ua_last_updated: >> self.log.debug("reregistering queries...") >> self.module.reregister_mds_perf_queries() >> self.prev_rank0_gid = rank0_gid >> else: >> #reschedule the timer >> self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL, >> self.re_register_queries, >> args=(rank0_gid, ua_last_updated,)) >> self.rqtimer.start() >> >> - Paul Simmons >> On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote: >> >>> Hello, >>> >>> I manage and configure my organization's Bareos backup system, which >>> backs-up millions of files totaling ~350 TB of data from an NFS mounted on >>> the Bareos server and stores the data in Volumes on a CephFS also mounted >>> on the Bareos server and uses disk-based storage for the Volumes unlike the >>> tape library we used previously. >>> >>> Over the last several months, bareos-sd has been encountering recurring >>> errors during Incremental and Full Jobs in which the Jobs fail with a fatal >>> SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a >>> month ago, but it hasn't seemed to resolve the errors. Below is the errors >>> from one of the joblogs: >>> >>> 2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs >>> associated with Volume "Full-3418". Marking it purged. >>> >>> 2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from >>> Volume "Full-3418"; marking it "Purged" >>> >>> 2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418" >>> >>> 2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" >>> on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous >>> data lost. >>> >>> 2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" >>> mounted on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at >>> 31-Jul-2024 02:16. >>> >>> 2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch >>> table with 800001 entries start >>> >>> 2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch >>> table done >>> >>> 2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: >>> stored/append.cc:447 Error reading data header from FD. ERR=No data >>> available >>> >>> 2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error >>> with FD during Backup: ERR=No data available >>> >>> 2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device >>> "Full-device0012" (/mnt/bareosfs/backups/Fulls/). >>> >>> 2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, >>> Transfer rate=14.96 M Bytes/second >>> >>> 2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status >>> returned from FD. >>> >>> 2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch >>> table with 384090 entries start >>> >>> 2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch >>> table done >>> >>> 2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir >>> 23.0.4~pre74.8cb0a0c26 >>> >>> Any assistance is troubleshooting this is greatly appreciated. I can >>> provide any configurations and other info as necessary, minus any IP >>> addresses or other confidential info. >>> >>> >>> - Paul Simmons >>> >> -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/85f33cc5-380f-4dc1-9c93-a81640d5cc3dn%40googlegroups.com.
