Hello BaseX-Community, I have a question concerning performance ... once again ... sorry ;-)
First thing I noticed is that when I run a query that involves quite a lot of data (more information below), the first execution of the query takes much more time as the second execution. In my example, the first execution on a certain machine (more information below) took 4,61 hours, the second execution of the very same query took 23,8 minutes. I suppose that there is some caching mechanism in place? Is that true? Second thing is that the same xQuery mentioned above has an extremly different performance on two different machines. I am testing the xQuery on my local PC where it runs quite fast (4,7 minutes on first execution; 4,5 minutes on second execution). On a (virtual) server (where the xQuery should run in the end) it takes much longer (as mentioned above: 4,61 hours on first execution; 23,8 minutes on second execution). Here are some more details: Local PC (fast execution): - OS: Ubuntu Desktop 18.04.5 LTS - RAM: 16GB LPDDR3, 2133 MT/s - Processor: Intel Core i7-8565U 1.80GHz, 8 CPUs - Java version: OpenJDK 11.0.9.1 - BaseX version: 9.4.4 (from the zip archive), using the command line client with Xmx6g Virtual Server, VMware (slow execution): - OS: Ubuntu Server 20.04.1 LTS - RAM: 16GB (no other details available) - Processor: Intel Xeon Gold 6126 2.60GHz, 4 CPUs - Java version: OpenJDK 11.0.9.1 - BaseX version: 9.4.4 (from the zip archive), using the command line client with Xmx6g The data for testing and the xQuery that is executed are exactly the same, also the BaseX version (9.4.4) and the assigned memory (6 GB). The data that I use are freely available library data. They can be downloaded here if that is for interest: 1. https://data.dnb.de/GND/authorities-geografikum_dnbmarc_20201013.mrc.xml.gz 2. https://data.dnb.de/GND/authorities-koerperschaft_dnbmarc_20201013.mrc.xml.gz 3. https://data.dnb.de/GND/authorities-kongress_dnbmarc_20201013.mrc.xml.gz 4. https://data.dnb.de/GND/authorities-person_dnbmarc_20201013.mrc.xml.gz 5. https://data.dnb.de/GND/authorities-sachbegriff_dnbmarc_20201013.mrc.xml.gz 6. https://data.dnb.de/GND/authorities-werk_dnbmarc_20201013.mrc.xml.gz So I have 6 data sets from the downloaded data. For each of these data sets I create a separate BaseX database (with options INTPARSE=true and STRIPNS=true ... other options are not touched, so that they have their default values). The databases are named "geografikum", "koerperschaft", "kongress", "person", "sachbegriff" and "werk". I have to mention that the data set "person" is quite big (~20 GB). The others are between 1 GB and 6 GB in size. The databases are not altered before the execution, so all indexes (ATTRINDEX, TEXTINDEX) should be up to date. The xQuery (see below) that I execute does this: 1. Get ID numbers from two other databases ("db1" and "db2" in the xQuery below). I get 144560 ID numbers in total. 2. Iterate over the 6 databases created from the data sets mentioned above and get all records that have one of the 144560 ID numbers that were retrieved before. 3. Count the retreived records from the 6 databases. The result is 134388 records. As mentioned: The execution (especially the first run) on the virtual server is very slow compared to the execution on the local PC. The final goal is not to count the retrieved records, but to create a new database from them. I just encountered the performance issues when I did some testing with "count()". But creating the new database has the same performance issues. So now I basically have 3 questions: 1. Why is the first execution of the xQuery so slow and the second one faster, especially on the server? 2. Why is the execution on the server much slower than on the local PC? 3. I there any way to speed that up? Would it e. g. make sense to split up the very big dataset "person" into multiple databases? Could that big dataset be the bottleneck? Finally, here is the xQuery that is executed: (: Get IDs from 2 other databases :) let $otherDBs := ('db1', 'db2') let $idsInOtherDBs := (# db:enforceindex #) { for $otherDB in $otherDBs return distinct-values(db:open($otherDB)/collection/record /datafield[@tag=('689', '100', '110', '111', '700', '710', '711')] /subfield[@code='0']) } (: Get records from 6 datasets by ID from other databases :) let $searchDBs := ('geografikum', 'koerperschaft', 'kongress', 'person', 'sachbegriff', 'werk') let $result := (# db:enforceindex #) { for $searchDB in $searchDBs return (db:text($searchDB, $idsInOtherDBs) /parent::element(subfield))[(@attribute(code)='a')] /parent::element(datafield)[(@attribute(tag)='035')] /parent::element(record) union (db:text($searchDB, $idsInOtherDBs) /parent::element(controlfield))[(@attribute(tag)='001')] /parent::element(record) } (: Count the records :) return count($result) As always: Thank you very much for any suggestions! Michael