Re: [Bacula-devel] Bacula 9.0.2 testing

Kern Sibbald Sun, 30 Jul 2017 02:21:25 -0700

Hello Phil,

See below ...



On 07/29/2017 10:35 PM, Phil Stracchino wrote:

I've now got everything updated to 9.0.2 using a work-in-progress
development version of the app-backup/bacula-9.0.2 ebuild.  I'm running
against MariaDB 10.2.7 (which more or less approximates MySQL 5.7) with
Galera enabled.

Warning MariaDB 10.2.7 has a serious bug that causes batch inserts tooccasionally fail. I submitted a bug report, and they duplicated theproblem with a script and are working on fixing it.



Build platforms:
Gentoo Linux on amd64 (AMD Phenom II, Thuban microarchitecture) using
gcc 6.3.0
Solaris 10u9 on amd64 (Intel P4 Xeon, Nocona microarchitecture) using
Solaris Studio 12.2
Solaris 11.3 on amd64 (AMD Opteron 2384, Shanghai microarchitecture)
using gcc 4.9.4

Build considerations:  Solaris 10 required the tgoto prototype in
conio.c to be moved down one line.  No other build issues encountered
other than that enabling building the storage daemon also forces
enabling the director, even if director is requested to be disabled.


I did change the DB write batch size limit at sql_create.c:870 from
500000 to 1000 per Galera best-performance recommendations.  I was able
to complete incremental backups and some differential backups.  I was
able to successfully run jobs that backed up as many as 120,000 files,
with wsrep_max_rows at its default of 128K.  A differential job that
tried to back up 177,000 files failed with wsrep_max_rows_exceeded.  If
that is truly the only place in the code that the write batch size is
set, then it appears database write batching is not actually working.

I maintain that even without Galera, 500000 is an unreasonably large
batch size.  Just because a modern database *can* handle it doesn't make
it a good idea.  50000 would be more reasonable, and 10000 would be better.

I had always thought the limit was 25,000, so was surprised
when I saw 500,000 I was a bit surprised.  I suppose it is a sort of
insane limit I added.  Whether it works or not, I don't know.  A bit
I found out why I remembered 25,000.  That is because it is the
maximum set for PostgreSQL.




Problems encountered so far, running the Director and both SDs in the
foreground at -d200:

1.  None of the datetime fields in the schema have defaults.  This is a
problem unless STRICT SQL mode is disabled, which is a bad idea.  It is
probable that in upcoming Oracle MySQL versions (and forks thereof),
strict SQL will be mandatory.

Adding the canonically-correct-SQL DEFAULT '1970-01-01-00:00:00' to all
datetime fields prevented any further DB-related outright *failures*.
However, this causes problems with Volume Use Duration settings.

Using DEFAULT '0000-00-00 00:00:00' for datetime is permitted by MySQL
5.7 or MariaDB 10.2.x *as long as* SQL_MODE does not include
NO_ZERO_DATE or NO_ZERO_IN_DATE.  This does not APPEAR to cause any
problems with volume expiration.



2.  Various actions in BAT still create multiple overlapping and
often-confusing dialog boxes.  Deleting a volume, for example, emits a
confirmation dialog, followed by three more simultaneous dialogs:

- Warning:  This command will delete volume ... and all Jobs saved on
that volume from the Catalog
- Bat Question:  Are you sure you want to delete Volume ...? (yes/no)
- Text input dialog:  Are you sure you want to delete Volume ...? (yes/no)

You can't respond to the Warning until you respond to the Text Input
Dialog.  You can't respond to the Text Input Dialog until you respond to
the Bat Question.  If you type in the text input dialog's text input
box, it will throw an error.  You have to ignore the text box and click
OK instead.

However, this APPEARS to no longer cause BAT to become unresponsive.  I
have not yet tried a PURGE VOLUME, which is the other operation that
would in the past cause BAT to become unresponsive.



3.  I am having difficulty getting my LTO4 SD to mount and unmount tapes.

This is what the director logged when trying to run a restore from the
LTO4 tape SD with the wrong tape mounted:


29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:279 Read
acquire: Wrong Volume mounted on Tape device "LTO-4"
+(/dev/nst0): Wanted LTO4-FULL-0019 have LTO4-FULL-0013
29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=No medium found

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=No medium found

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=No medium found

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=No medium found

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Warning: acquire.c:235 Read open
Tape device "LTO-4" (/dev/nst0) Volume
+"LTO4-FULL-0019" failed: ERR=tape_dev.c:170 Unable to open device
"LTO-4" (/dev/nst0): ERR=Input/output error

29-Jul 13:13 babylon5-sd JobId 14248: Fatal error: acquire.c:328 Too
many errors trying to mount Tape device "LTO-4"
+(/dev/nst0) for reading.
29-Jul 13:13 babylon4 JobId 14248: Fatal error: job.c:2699 Bad response
from SD to Read Data command. Wanted 3000 OK data
, got len=11 msg="3000 error "


If I *START* the sd with the correct tape in place, it automounts it
just fine.  I was able to complete a test restore that required a single
tape by pre-loading the tape.  But I cannot manually mount or unmount
tapes, either from BAT or from the console.  It just plain doesn't work.
  Nothing happens.  The SD doesn't log *anything* (at -d200) and as far
as I can tell, never receives the mount or umount commands.


status storage=babylon5-sd says about the device:

Device status:

Device Tape is "LTO-4" (/dev/nst0) mounted with:
     Volume:      LTO4-FULL-0019
     Pool:        *unknown*
     Media type:  LTO-4
     Total Bytes Read=0 Blocks Read=0 Bytes/block=0
     Positioned at File=0 Block=0
Configured device capabilities:
    EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT !LABEL !ANONVOLS
ALWAYSOPEN
Device state:
    OPENED TAPE LABEL !MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL
!SHORT !MOUNTED
    Writers=0 reserves=0 blocked=0 enabled=1 usage=1,024
Attached JobIds:
Device parameters:
    Archive name: /dev/nst0 Device name: LTO-4
    File=0 block=0
    Min block=0 Max block=2048000


Do I need to re-test my tape drive under Bacula 9.x?
Has something changed between 7.4.7 and 9 x in tape handling that
requires configuration changes?


Summary:
- Can't run full backups because I can't mount and unmount LTO4 tapes
except by restarting the SD, which will cause the running jobs to fail
- Database write batching is not working, causing jobs that back up more
than 128K files to fail
- Schema is not compliant with MySQL 5.7 or MariaDB 10.2 with strict SQL
compliance enabled, which will cause many database-related failures

You have specified too many problems for me to deal with -- sorry.

I will say that the driver code for 9.0.x is totally rewritten from 7.4.x.
However the general high level code that mounts, unmounts, and
all that has not changed much.  Since it was such a massive rewrite,
there is a possibility of problems, but none of the rather extensive
regression tests shows problems, and the new code has been working
here on my (very simple) autochanger setup for at least a year.

I am not planning on working on Bat any more, but I do use it myself,
and I have noticed the annoying number of prompts to do something, but
it is not sufficiently annoying enough to make me dig into the code.

If it will not do the few things I need, I will fix it. Otherwise, I amtrying

to switch over to Baculum, but I have not quite succeeded in making the
change.

Most of the things you mention will probably ultimately end up being fixed
except perhaps bat, because Bacula Systems has a team of programmers
working on the problems that are reported to them, and obviously at some
point (or immediately if I notice it) I backport the fixes they make from
the Enterprise version to the Community version.  Obviously if the
problem strikes me, or it is clearly documented in a bug report, it has
a much higher probability of being fixed.


Best regards,
Kern

Best regards,
Kern

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Bacula 9.0.2 testing

Reply via email to