On 1/15/21 11:23 AM, Phil Stracchino wrote: > On 1/15/21 11:11 AM, Phil Stracchino wrote: >> On 1/15/21 9:54 AM, Phil Stracchino wrote: >>> I will TRY to find time to look at the code and see whether I can patch >>> it myself and test. I might get time today or this weekend. >> >> Actually, I managed to fit in a look right now. I've now rebuilt with >> the following patch: > > > ...however it's just occurred to me that I probably actually patched the > wrong place looking at it in too much of a hurry. I'm going to have to > recheck this.
On a second look, no, I think I was in the right place, but sql_cmds.c should also be patched. However, the current query in sql_cmds.c: const char *batch_fill_path_query[] = { /* Mysql */ "INSERT INTO Path (Path) " "SELECT a.Path FROM " "(SELECT DISTINCT Path FROM batch) AS a WHERE NOT EXISTS " "(SELECT Path FROM Path AS p WHERE p.Path = a.Path)", is far more complex than it needs to be. Frankly this is a really badly written query. Even as it is using a simple INSERT, the 'SELECT a.Path from' line is utterly redundant and serves no purpose whatsoever. (In fact, the query optimizer detects this, and optimizes that SELECT out completely.) Even replacing this with: "INSERT IGNORE INTO Path (Path) " "SELECT DISTINCT Path FROM batch", would probably perform a lot better as it eliminates TWO nested dependent subqueries. A far better way to perform the insert would be: INSERT IGNORE INTO Path (Path) SELECT a.Path FROM batch a LEFT JOIN Path p ON a.Path = p.Path WHERE ISNULL(p.Path); This both filters already-existing Paths out of the batch set much more efficiently, AND does not fail if another job inserts those same Paths first. If you compare the execution plans for the two queries you will see that the second performs ENORMOUSLY better. The same applies to the following batch_fill_filename_query[]. This is the patch for that change: --- src/cats/sql_cmds.c.orig 2020-12-10 08:26:39.000000000 -0500 +++ src/cats/sql_cmds.c 2021-01-15 12:04:40.065125913 -0500 @@ -895,15 +895,16 @@ "COMMIT" }; const char *batch_fill_path_query[] = { /* Mysql */ - "INSERT INTO Path (Path) " - "SELECT a.Path FROM " - "(SELECT DISTINCT Path FROM batch) AS a WHERE NOT EXISTS " - "(SELECT Path FROM Path AS p WHERE p.Path = a.Path)", - + + "INSERT IGNORE INTO Path (Path)" + "SELECT a.Path FROM batch a" + "LEFT JOIN Path p ON a.Path = p.Path" + "WHERE ISNULL(p.Path)", + /* PostgreSQL */ "INSERT INTO Path (Path)" "SELECT a.Path FROM " "(SELECT DISTINCT Path FROM batch) AS a " "WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path) ", @@ -914,14 +915,14 @@ "EXCEPT SELECT Path FROM Path" }; const char *batch_fill_filename_query[] = { /* Mysql */ - "INSERT INTO Filename (Name) " - "SELECT a.Name FROM " - "(SELECT DISTINCT Name FROM batch) AS a WHERE NOT EXISTS " - "(SELECT Name FROM Filename AS f WHERE f.Name = a.Name)", + "INSERT IGNORE INTO Filename (Name)" + "SELECT a.Name FROM batch a" + "LEFT JOIN Filename f ON a.Name = f.Name" + "WHERE ISNULL(f.Name)", /* Postgresql */ "INSERT INTO Filename (Name) " "SELECT a.Name FROM " "(SELECT DISTINCT Name FROM batch) as a " However, I CANNOT test this myself because I cannot use batch mode, because batch mode as currently implemented is not compatible with Galera clusters. This is because it collects all of the database transactions from each job into a temporary table in the database, then dumps the entire batch table into the main database table in a single massive belch, which means any job that backs up more than 128K files using a Galera cluster will always fail. (Galera 3 cluster replication has a hard limit on writesets of 128K rows or 4GB total data length, whichever comes first.) I'm frankly rather skeptical of the benefits of batch mode on MySQL anyway, since as far as I can see its principal impact is that every piece of data has to be written into the database TWICE. (Once into the batch file, then a second time to copy from the batch file to the main tables.) I would think that a much better way to do this would be to cache, say, 1000 records at a time in memory, then flush that memory cache into the database. You can't reduce database traffic by caching the database traffic in the database. That only INCREASES database traffic. I have run both with batch mode (before I clustered my database) and without, and frankly I saw no visible overall performance improvement from batching. -- Phil Stracchino Babylon Communications ph...@caerllewys.net p...@co.ordinate.org Landline: +1.603.293.8485 Mobile: +1.603.998.6958 _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel