Hello BaseX-Community,

I have a question concerning performance ... once again ... sorry ;-)

First thing I noticed is that when I run a query that involves quite a lot of 
data (more information below), the first execution of the query takes much more 
time as the second execution. In my example, the first execution on a certain 
machine (more information below) took 4,61 hours, the second execution of the 
very same query took 23,8 minutes. I suppose that there is some caching 
mechanism in place? Is that true?

Second thing is that the same xQuery mentioned above has an extremly different 
performance on two different machines. I am testing the xQuery on my local PC 
where it runs quite fast (4,7 minutes on first execution; 4,5 minutes on second 
execution). On a (virtual) server (where the xQuery should run in the end) it 
takes much longer (as mentioned above: 4,61 hours on first execution; 23,8 
minutes on second execution). Here are some more details:

Local PC (fast execution):
  - OS: Ubuntu Desktop 18.04.5 LTS
  - RAM: 16GB LPDDR3, 2133 MT/s
  - Processor: Intel Core i7-8565U 1.80GHz, 8 CPUs
  - Java version: OpenJDK 11.0.9.1
  - BaseX version: 9.4.4 (from the zip archive), using the command line client 
with Xmx6g

Virtual Server, VMware (slow execution):
  - OS: Ubuntu Server 20.04.1 LTS
  - RAM: 16GB (no other details available)
  - Processor: Intel Xeon Gold 6126 2.60GHz, 4 CPUs
  - Java version: OpenJDK 11.0.9.1
  - BaseX version: 9.4.4 (from the zip archive), using the command line client 
with Xmx6g

The data for testing and the xQuery that is executed are exactly the same, also 
the BaseX version (9.4.4) and the assigned memory (6 GB).

The data that I use are freely available library data. They can be downloaded 
here if that is for interest:
  1. https://data.dnb.de/GND/authorities-geografikum_dnbmarc_20201013.mrc.xml.gz
  2. 
https://data.dnb.de/GND/authorities-koerperschaft_dnbmarc_20201013.mrc.xml.gz
  3. https://data.dnb.de/GND/authorities-kongress_dnbmarc_20201013.mrc.xml.gz
  4. https://data.dnb.de/GND/authorities-person_dnbmarc_20201013.mrc.xml.gz
  5. https://data.dnb.de/GND/authorities-sachbegriff_dnbmarc_20201013.mrc.xml.gz
  6. https://data.dnb.de/GND/authorities-werk_dnbmarc_20201013.mrc.xml.gz

So I have 6 data sets from the downloaded data. For each of these data sets I 
create a separate BaseX database (with options INTPARSE=true and STRIPNS=true 
... other options are not touched, so that they have their default values). The 
databases are named "geografikum", "koerperschaft", "kongress", "person", 
"sachbegriff" and "werk". I have to mention that the data set "person" is quite 
big (~20 GB). The others are between 1 GB and 6 GB in size. The databases are 
not altered before the execution, so all indexes (ATTRINDEX, TEXTINDEX) should 
be up to date.

The xQuery (see below) that I execute does this:
  1. Get ID numbers from two other databases ("db1" and "db2" in the xQuery 
below). I get 144560 ID numbers in total.
  2. Iterate over the 6 databases created from the data sets mentioned above 
and get all records that have one of the 144560 ID numbers that were retrieved 
before.
  3. Count the retreived records from the 6 databases. The result is 134388 
records.

As mentioned: The execution (especially the first run) on the virtual server is 
very slow compared to the execution on the local PC. The final goal is not to 
count the retrieved records, but to create a new database from them. I just 
encountered the performance issues when I did some testing with "count()". But 
creating the new database has the same performance issues.

So now I basically have 3 questions:
  1. Why is the first execution of the xQuery so slow and the second one 
faster, especially on the server?
  2. Why is the execution on the server much slower than on the local PC?
  3. I there any way to speed that up? Would it e. g. make sense to split up 
the very big dataset "person" into multiple databases? Could that big dataset 
be the bottleneck?

Finally, here is the xQuery that is executed:

(: Get IDs from 2 other databases :)
let $otherDBs := ('db1', 'db2')
let $idsInOtherDBs :=
  (# db:enforceindex #) {
  for $otherDB in $otherDBs
    return distinct-values(db:open($otherDB)/collection/record
    /datafield[@tag=('689', '100', '110', '111', '700', '710', '711')]
    /subfield[@code='0'])
  }

(: Get records from 6 datasets by ID from other databases :)
let $searchDBs := ('geografikum', 'koerperschaft', 'kongress', 'person', 
'sachbegriff', 'werk')
let $result :=
  (# db:enforceindex #) {
  for $searchDB in $searchDBs
    return
      (db:text($searchDB, $idsInOtherDBs)
        /parent::element(subfield))[(@attribute(code)='a')]
        /parent::element(datafield)[(@attribute(tag)='035')]
        /parent::element(record)
      union
      (db:text($searchDB, $idsInOtherDBs)
        /parent::element(controlfield))[(@attribute(tag)='001')]
        /parent::element(record)
  }

(: Count the records :)
return count($result)

As always: Thank you very much for any suggestions!

Michael

Reply via email to