Yes, the following is set on the pool to keep the volumes manageable:

  Maximum Volume Bytes = 653197960000
  Maximum Volumes = 18

There are three 3.7 TB disks in that pool, I reckon this should keep the disks from completely filling up.

Marko Bartolić, Nimium d.o.o., email: <marko.barto...@nimium.hr>
Mob: +385 99 325 7180
Zagrebačka cesta 145A/I, 10 000 Zagreb, Hrvatska

On 22. 10. 2021. 17:06, Martin Simmons wrote:
Ah, so the short block is part of jobid 12065 and therefore the copy of jobid
12068 is very likely to be complete.

The "[SI0202] End of Volume" message means that write() wrote less that
expected.  The only reasons I can think of are disk space shortage, I/O error,
some other process-specific limit (such as controlled by ulimit -f) or buggy
filesystem.

Have you set some limit on the volume size?  It is interesting that those 3
volumes are all just slightly smaller that 653,198,000,000 bytes.

__Martin


On Fri, 22 Oct 2021 13:53:12 +0200, Marko Bartolić said:
Hi,

the backup job on 03-Oct definetly filled the volume 10, but not the
whole disk, it still has some free space.

*list volume

Automatically selected Catalog: MyCatalog

Using Catalog "MyCatalog"

Pool: Default

+---------+------------+-----------+---------+-----------------+----------+--------------+---------+------+-----------+-----------+---------+----------+---------------------+------------+

| mediaid | volumename | volstatus | enabled | volbytes        | volfiles | 
volretention | recycle | slot | inchanger | mediatype | voltype | volparts | 
lastwritten         | expiresin  |

+---------+------------+-----------+---------+-----------------+----------+--------------+---------+------+-----------+-----------+---------+----------+---------------------+------------+

...

|       9 | volume-09  | Full      |       1 | 653,197,959,321 |      152 |   
15,552,000 |       1 |    0 |         0 | data1     |       1 |        0 | 
2021-09-06 23:06:04 | 11,615,551 |
|      10 | volume-10  | Full      |       1 | 653,197,888,553 |      152 |   
15,552,000 |       1 |    0 |         0 | data1     |       1 |        0 | 
2021-10-03 23:09:09 | 13,944,936 |
|      11 | volume-11  | Full      |       1 | 653,197,958,518 |      152 |   
15,552,000 |       1 |    0 |         0 | data1     |       1 |        0 | 
2021-10-05 13:11:17 | 14,081,864 |

...



/var/log/messages or dmesg do not show any problems on 03-Oct.

Yes, that is the complete output for jobid 12068 from bacula.log. But
there was another job running at approximately same time during which
the volume changed from 10 to 11:

03-Oct 23:09 backup-sd JobId 12065: [SI0202] End of Volume "volume-10" at 653197953065 on 
device "backup-device-data1" (/data1). Write of 64512 bytes got 42967.

03-Oct 23:09 backup-sd JobId 12065: End of medium on Volume "volume-10" 
Bytes=653,197,888,553 Blocks=10,125,619 at 03-Oct-2021 23:09.

03-Oct 23:09 backup-dir JobId 12065: There are no more Jobs associated with Volume 
"volume-11". Marking it purged.

03-Oct 23:09 backup-dir JobId 12065: All records pruned from Volume "volume-11"; marking 
it "Purged"

03-Oct 23:09 backup-dir JobId 12065: Recycled volume "volume-11"

03-Oct 23:09 backup-sd JobId 12065: Recycled volume "volume-11" on File device 
"backup-device-data1" (/data1), all previous data lost.

03-Oct 23:09 backup-sd JobId 12065: New volume "volume-11" mounted on device 
"backup-device-data1" (/data1) at 03-Oct-2021 23:09.



Marko Bartolić, Nimium d.o.o., email:<marko.barto...@nimium.hr>

On 22. 10. 2021. 13:18, Martin Simmons wrote:
Did the backup job on 03-Oct fill the disk?  Did you check for syslog messages
for that date too?

Also, did you include the complete output below from the backup jobid 12068?
I would expect more information to be printed between 03-Oct 23:06 and 03-Oct
23:19 about the volumes being used and the reason for changing from volume-10
to volume-11.

__Martin


On Thu, 21 Oct 2021 16:27:13 +0200, Marko Bartolić said:
Hello,

I hope this is the right place to ask.

We have Bacula version 9.6.5. on a RHEL 8.2 physical server. We use copy
job that runs every 6 months and copies the last full backup of 22
clients to a separate SCSI disk which is then pulled out from the server
for offsite backup. So both the backup disks and disk on which the
backups are later copied are local, physical disks in the machine. This
was working fine every time up to now. During last run of the copy job,
we noticed the following error on one of the copy jobs:

21-Oct 04:00 backup-dir JobId 12612: Copying using JobId=12068 
Job=websrv.2021-10-03_23.05.00_12
21-Oct 05:45 backup-dir JobId 12612: Start Copying JobId 12612, 
Job=Copy-to-offsite-disk.2021-10-21_04.00.01_36
21-Oct 05:45 backup-dir JobId 12612: Using Device "backup-device-data1" to read.
21-Oct 05:45 backup-sd JobId 12612: Ready to read from volume "volume-10" on File device 
"backup-device-data1" (/data1).
21-Oct 05:45 backup-sd JobId 12612: Forward spacing Volume "volume-10" to 
addr=641829209227
*21-Oct 05:46 backup-sd JobId 12612: Warning: block.c:682 [SE0208] Volume data error at 
0:0! Short block of 42967 bytes on device "backup-device-data1" (/data1) 
discarded. **
**21-Oct 05:46 backup-sd JobId 12612: Error: read_records.c:160 block.c:682 [SE0208] 
Volume data error at 0:0! Short block of 42967 bytes on device 
"backup-device-data1" (/data1) discarded. *
21-Oct 05:46 backup-sd JobId 12612: End of Volume "volume-10" at addr=653197996032 on 
device "backup-device-data1" (/data1).
21-Oct 05:46 backup-sd JobId 12612: Ready to read from volume "volume-11" on File device 
"backup-device-data1" (/data1).
21-Oct 05:46 backup-sd JobId 12612: Forward spacing Volume "volume-11" to 
addr=64742
21-Oct 05:50 backup-sd JobId 12612: End of Volume "volume-11" at addr=49666111453 on 
device "backup-device-data1" (/data1).
21-Oct 05:50 backup-sd JobId 12612: Elapsed time=00:05:02, Transfer rate=16.65 
M Bytes/second
21-Oct 05:50 backup-dir JobId 12612: Error: Bacula backup-dir 9.6.5 (11Jun20):
    Build OS:               x86_64-redhat-linux-gnu-bacula redhat (Core)
    Prev Backup JobId:      12068
    Prev Backup Job:        websrv.2021-10-03_23.05.00_12
    New Backup JobId:       12613
    Current JobId:          12612
    Current Job:            Copy-to-offsite-disk.2021-10-21_04.00.01_36
    Backup Level:           Full
    Client:                 backup-fd
    FileSet:                "backup" 2020-09-17 21:53:08
    Read Pool:              "Default" (From Command input)
    Read Storage:           "backup-storage-data1" (From Pool resource)
    Write Pool:             "Offsite" (From Command input)
    Write Storage:          "backup-storage-data2" (From Command input)
    Catalog:                "MyCatalog" (From Client resource)
    Start time:             21-Oct-2021 05:45:32
    End time:               21-Oct-2021 05:50:42
    Elapsed time:           5 mins 10 secs
    Priority:               10
* SD Files Written:       200,038 **
** SD Bytes Written:       5,029,190,722 (5.029 GB) *
    Rate:                   16223.2 KB/s
    Volume name(s):         vol-01
    Volume Session Id:      5353
    Volume Session Time:    1619097903
    Last Volume Bytes:      573,628,844,941 (573.6 GB)
    SD Errors:              1
    SD termination status:  OK
    Termination:            *** Copying Error ***



Job ends with error, but upon checking the backup size and number of
files of the original "websrv" backup job (which was successful), we see
that the number of files and bytes written are the same as is in the
copy job:

03-Oct 23:06 backup-dir JobId 12068: Start Backup JobId 12068, 
Job=websrv.2021-10-03_23.05.00_12
03-Oct 23:06 backup-dir JobId 12068: Using Device "backup-device-data1" to 
write.
03-Oct 23:06 websrv-fd JobId 12068: shell command: run ClientBeforeJob 
"/etc/bacula/scripts/mariadb-backup.sh"
03-Oct 23:10 websrv-fd JobId 12068:      /srv/cpap is a different filesystem. 
Will not descend from / into it.
03-Oct 23:19 backup-sd JobId 12068: Elapsed time=00:13:26, Transfer rate=6.239 
M Bytes/second
03-Oct 23:19 backup-sd JobId 12068: Sending spooled attrs to the Director. 
Despooling 53,043,158 bytes ...
03-Oct 23:19 backup-dir JobId 12068: Bacula backup-dir 9.6.5 (11Jun20):
     Build OS:               x86_64-redhat-linux-gnu-bacula redhat (Core)
     JobId:                  12068
     Job:                    websrv.2021-10-03_23.05.00_12
     Backup Level:           Full
     Client:                 "websrv-fd" 9.0.6 (20Nov17) 
x86_64-redhat-linux-gnu,redhat,Enterprise 8.1
     FileSet:                "websrv" 2020-09-17 23:05:00
     Pool:                   "Default" (From Job resource)
     Catalog:                "MyCatalog" (From Client resource)
     Storage:                "backup-storage-data1" (From Pool resource)
     Scheduled time:         03-Oct-2021 23:05:00
     Start time:             03-Oct-2021 23:06:04
     End time:               03-Oct-2021 23:19:37
     Elapsed time:           13 mins 33 secs
     Priority:               10
*FD Files Written: 200,038 SD Files Written: 200,038 FD Bytes Written:
4,996,564,202 (4.996 GB) SD Bytes Written: 5,029,190,722 (5.029 GB)*
     Rate:                   6145.8 KB/s
     Software Compression:   40.9% 1.7:1
     Comm Line Compression:  16.7% 1.2:1
     Snapshot/VSS:           no
     Encryption:             no
     Accurate:               no
     Volume name(s):         volume-10|volume-11
     Volume Session Id:      4812
     Volume Session Time:    1619097903
     Last Volume Bytes:      49,996,799,959 (49.99 GB)
     Non-fatal FD errors:    0
     SD Errors:              0
     FD termination status:  OK
     SD termination status:  OK
     Termination:            Backup OK

03-Oct 23:19 backup-dir JobId 12068: Begin pruning Jobs older than 6 months .
03-Oct 23:19 backup-dir JobId 12068: Pruned 1 Job for client websrv-fd from 
catalog.
03-Oct 23:19 backup-dir JobId 12068: Begin pruning Files.
03-Oct 23:19 backup-dir JobId 12068: Pruned Files from 1 Jobs for client 
websrv-fd from catalog.
03-Oct 23:19 backup-dir JobId 12068: End auto prune.


I also don't see any errors in messages or dmesg output.

   From what I gathered from an older thread:
https://marc.info/?l=bacula-devel&m=123237957307571&w=2
<https://marc.info/?l=bacula-devel&m=123237957307571&w=2>  , this points
that copying completed successfully. Also mentioned in the thread is
that "Maximium Block Size" could be the problem, but we haven't defined
that property.

Does anyone know what could cause this?


BR

--
Marko Bartolić, Nimium d.o.o., email:<marko.barto...@nimium.hr>





_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to