[bareos-users] Re: Recurring SD Errors

Paul Simmons Thu, 08 Aug 2024 14:38:32 -0700

Will do, thank you.

- Paul Simmons


On Thursday, August 8, 2024 at 12:43:54 PM UTC-7 Bruno Friedmann 
(bruno-at-bareos) wrote:

> Please open an issue on github, will be the best place to track it down.
>
>
> On Thursday 8 August 2024 at 20:06:08 UTC+2 Paul Simmons wrote:
>
>> Addendum: below is the error log taken from the FD which mentions a 
>> segmentation violation during the data stream from FD to SD and this 
>> appears to be a result of a malformed response from Ceph *perf_stats.py* 
>> on the storage medium and the Bareos *append.cc* has a while statement 
>> that doesn't seem to account for interrupts in the data stream from 
>> malformed responses and thus throws an error indicating segmentation 
>> violation. This results in the Job failing and entire rescheduling and 
>> rerun of the Job. This may be a possible bug with Bareos, should I move 
>> this ticket over to bug tracking at Bareos GitHub page?
>>
>> This was log messages from the FD during the Full backup Job running in 
>> the STDOUT posted in my original comment.
>>
>> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
>> 11: Segmentation violation
>> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
>> 11: Segmentation violation
>> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got 
>> signal 11 - Segmentation violation. Attempting traceback.
>> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/
>> Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: 
>> /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos
>> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback 
>> worked...
>> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: 
>> /var/lib/bareos/pebbles-fd1.1309.bactrace
>> Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 
>> 1000067e647.fffffffffffffffe failed, err=-512
>>
>> Error message from the Ceph manager, pebbles01 is one of the storage 
>> servers within the Ceph cluster where the Volumes are stored on CephFS as a 
>> POSIX filesystem.
>>
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>> Exception in thread Thread-126185:
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>> Traceback (most recent call last):
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>>  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>>  self.run()
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>>  File "/lib64/python3.6/threading.py", line 1203, in run
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>>  self.function(*self.args, **self.kwargs)
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>>  File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in 
>> re_register_queries
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>>  if self.mx_last_updated >= ua_last_updated:
>> Jul 31 22:28:29 pebbles01 
>> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>> AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'
>>
>> Could be relevant to this issue:
>> https://tracker.ceph.com/issues/65073
>> This can happen when FSPerfStats.re_register_queries is called before 
>> mgr/stats can process a single mds report.
>>
>> These lines from perf_stats.py shows that a malformed response could 
>> potentially be sent to Bareos...
>>
>>     def re_register_queries(self, rank0_gid, ua_last_updated):
>>         #reregister queries if the metrics are the latest. Otherwise 
>> reschedule the timer and
>>         #wait for the empty metrics
>>         with self.lock:
>>             if self.mx_last_updated >= ua_last_updated:
>>                 self.log.debug("reregistering queries...")
>>                 self.module.reregister_mds_perf_queries()
>>                 self.prev_rank0_gid = rank0_gid
>>             else:
>>                 #reschedule the timer
>>                 self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL,
>>                                      self.re_register_queries, 
>> args=(rank0_gid, ua_last_updated,))
>>                 self.rqtimer.start()
>>
>> - Paul Simmons
>> On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote:
>>
>>> Hello,
>>>
>>> I manage and configure my organization's Bareos backup system, which 
>>> backs-up millions of files totaling ~350 TB of data from an NFS mounted on 
>>> the Bareos server and stores the data in Volumes on a CephFS also mounted 
>>> on the Bareos server and uses disk-based storage for the Volumes unlike the 
>>> tape library we used previously.
>>>
>>> Over the last several months, bareos-sd has been encountering recurring 
>>> errors during Incremental and Full Jobs in which the Jobs fail with a fatal 
>>> SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a 
>>> month ago, but it hasn't seemed to resolve the errors. Below is the errors 
>>> from one of the joblogs:
>>>
>>>  2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs 
>>> associated with Volume "Full-3418". Marking it purged.
>>>
>>>  2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from 
>>> Volume "Full-3418"; marking it "Purged"
>>>
>>>  2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418"
>>>
>>>  2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" 
>>> on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous 
>>> data lost.
>>>
>>>  2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" 
>>> mounted on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at 
>>> 31-Jul-2024 02:16.
>>>
>>>  2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch 
>>> table with 800001 entries start
>>>
>>>  2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch 
>>> table done
>>>
>>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: 
>>> stored/append.cc:447 Error reading data header from FD. ERR=No data 
>>> available
>>>
>>>  2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error 
>>> with FD during Backup: ERR=No data available
>>>
>>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device 
>>> "Full-device0012" (/mnt/bareosfs/backups/Fulls/).
>>>
>>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, 
>>> Transfer rate=14.96 M Bytes/second
>>>
>>>  2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status 
>>> returned from FD.
>>>
>>>  2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch 
>>> table with 384090 entries start
>>>
>>>  2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch 
>>> table done
>>>
>>>  2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir 
>>> 23.0.4~pre74.8cb0a0c26
>>>
>>> Any assistance is troubleshooting this is greatly appreciated. I can 
>>> provide any configurations and other info as necessary, minus any IP 
>>> addresses or other confidential info.
>>>
>>>
>>> - Paul Simmons
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"bareos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bareos-users/85f33cc5-380f-4dc1-9c93-a81640d5cc3dn%40googlegroups.com.

[bareos-users] Re: Recurring SD Errors

Reply via email to