On 1/15/21 11:23 AM, Phil Stracchino wrote:
> On 1/15/21 11:11 AM, Phil Stracchino wrote:
>> On 1/15/21 9:54 AM, Phil Stracchino wrote:
>>> I will TRY to find time to look at the code and see whether I can patch
>>> it myself and test.  I might get time today or this weekend.
>>
>> Actually, I managed to fit in a look right now.  I've now rebuilt with
>> the following patch:
> 
> 
> ...however it's just occurred to me that I probably actually patched the
> wrong place looking at it in too much of a hurry.  I'm going to have to
> recheck this.


On a second look, no, I think I was in the right place, but sql_cmds.c
should also be patched.  However, the current query in sql_cmds.c:

const char *batch_fill_path_query[] = {
   /* Mysql */
   "INSERT INTO Path (Path) "
      "SELECT a.Path FROM "
         "(SELECT DISTINCT Path FROM batch) AS a WHERE NOT EXISTS "
         "(SELECT Path FROM Path AS p WHERE p.Path = a.Path)",

is far more complex than it needs to be.  Frankly this is a really badly
written query.  Even as it is using a simple INSERT, the 'SELECT a.Path
from' line is utterly redundant and serves no purpose whatsoever.  (In
fact, the query optimizer detects this, and optimizes that SELECT out
completely.)

Even replacing this with:

   "INSERT IGNORE INTO Path (Path) "
      "SELECT DISTINCT Path FROM batch",

would probably perform a lot better as it eliminates TWO nested
dependent subqueries.

A far better way to perform the insert would be:

INSERT IGNORE INTO Path (Path)
    SELECT a.Path FROM batch a
        LEFT JOIN Path p ON a.Path = p.Path
        WHERE ISNULL(p.Path);

This both filters already-existing Paths out of the batch set much more
efficiently, AND does not fail if another job inserts those same Paths
first.  If you compare the execution plans for the two queries you will
see that the second performs ENORMOUSLY better.


The same applies to the following batch_fill_filename_query[].

This is the patch for that change:


--- src/cats/sql_cmds.c.orig    2020-12-10 08:26:39.000000000 -0500
+++ src/cats/sql_cmds.c 2021-01-15 12:04:40.065125913 -0500
@@ -895,15 +895,16 @@
    "COMMIT"
 };

 const char *batch_fill_path_query[] = {
    /* Mysql */
-   "INSERT INTO Path (Path) "
-      "SELECT a.Path FROM "
-         "(SELECT DISTINCT Path FROM batch) AS a WHERE NOT EXISTS "
-         "(SELECT Path FROM Path AS p WHERE p.Path = a.Path)",
-
+
+   "INSERT IGNORE INTO Path (Path)"
+        "SELECT a.Path FROM batch a"
+           "LEFT JOIN Path p ON a.Path = p.Path"
+           "WHERE ISNULL(p.Path)",
+
    /* PostgreSQL */
    "INSERT INTO Path (Path)"
       "SELECT a.Path FROM "
          "(SELECT DISTINCT Path FROM batch) AS a "
        "WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path) ",
@@ -914,14 +915,14 @@
       "EXCEPT SELECT Path FROM Path"
 };

 const char *batch_fill_filename_query[] = {
    /* Mysql */
-   "INSERT INTO Filename (Name) "
-      "SELECT a.Name FROM "
-         "(SELECT DISTINCT Name FROM batch) AS a WHERE NOT EXISTS "
-         "(SELECT Name FROM Filename AS f WHERE f.Name = a.Name)",
+   "INSERT IGNORE INTO Filename (Name)"
+        "SELECT a.Name FROM batch a"
+           "LEFT JOIN Filename f ON a.Name = f.Name"
+           "WHERE ISNULL(f.Name)",

    /* Postgresql */
    "INSERT INTO Filename (Name) "
       "SELECT a.Name FROM "
          "(SELECT DISTINCT Name FROM batch) as a "



However, I CANNOT test this myself because I cannot use batch mode,
because batch mode as currently implemented is not compatible with
Galera clusters.  This is because it collects all of the database
transactions from each job into a temporary table in the database, then
dumps the entire batch table into the main database table in a single
massive belch, which means any job that backs up more than 128K files
using a Galera cluster will always fail.  (Galera 3 cluster replication
has a hard limit on writesets of 128K rows or 4GB total data length,
whichever comes first.)


I'm frankly rather skeptical of the benefits of batch mode on MySQL
anyway, since as far as I can see its principal impact is that every
piece of data has to be written into the database TWICE.  (Once into the
batch file, then a second time to copy from the batch file to the main
tables.)  I would think that a much better way to do this would be to
cache, say, 1000 records at a time in memory, then flush that memory
cache into the database.  You can't reduce database traffic by caching
the database traffic in the database.  That only INCREASES database traffic.

I have run both with batch mode (before I clustered my database) and
without, and frankly I saw no visible overall performance improvement
from batching.


-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to