Hi Michael,

> I have a question concerning performance ... once again ... sorry ;-)

No reason to be sorry ;) Instead I am tempted to say sorry that I may
not really give you helpful advice on why your queries are so much
slower in your virtual environment. I’d guess that this is not
particularly related to BaseX, but rather depends on your general
setup. If you use VMware on your local PC, 16 GB RAM for the VM may
simply be too much, and all the data will be swapped to your hard
disk. What happens if you reduce the VM RAM size to 8 GB?

You could try to reduce the Xmx6g value to a smaller one, as most of
the caching magic that BaseX benefits from will be performed by your
operating system. If you assign more fixed memory to BaseX, there may
be less memory available for the OS.

Hope this helps,
Christian



> First thing I noticed is that when I run a query that involves quite a lot of 
> data (more information below), the first execution of the query takes much 
> more time as the second execution. In my example, the first execution on a 
> certain machine (more information below) took 4,61 hours, the second 
> execution of the very same query took 23,8 minutes. I suppose that there is 
> some caching mechanism in place? Is that true?
>
> Second thing is that the same xQuery mentioned above has an extremly 
> different performance on two different machines. I am testing the xQuery on 
> my local PC where it runs quite fast (4,7 minutes on first execution; 4,5 
> minutes on second execution). On a (virtual) server (where the xQuery should 
> run in the end) it takes much longer (as mentioned above: 4,61 hours on first 
> execution; 23,8 minutes on second execution). Here are some more details:
>
> Local PC (fast execution):
>   - OS: Ubuntu Desktop 18.04.5 LTS
>   - RAM: 16GB LPDDR3, 2133 MT/s
>   - Processor: Intel Core i7-8565U 1.80GHz, 8 CPUs
>   - Java version: OpenJDK 11.0.9.1
>   - BaseX version: 9.4.4 (from the zip archive), using the command line 
> client with Xmx6g
>
> Virtual Server, VMware (slow execution):
>   - OS: Ubuntu Server 20.04.1 LTS
>   - RAM: 16GB (no other details available)
>   - Processor: Intel Xeon Gold 6126 2.60GHz, 4 CPUs
>   - Java version: OpenJDK 11.0.9.1
>   - BaseX version: 9.4.4 (from the zip archive), using the command line 
> client with Xmx6g
>
> The data for testing and the xQuery that is executed are exactly the same, 
> also the BaseX version (9.4.4) and the assigned memory (6 GB).
>
> The data that I use are freely available library data. They can be downloaded 
> here if that is for interest:
>   1. 
> https://data.dnb.de/GND/authorities-geografikum_dnbmarc_20201013.mrc.xml.gz
>   2. 
> https://data.dnb.de/GND/authorities-koerperschaft_dnbmarc_20201013.mrc.xml.gz
>   3. https://data.dnb.de/GND/authorities-kongress_dnbmarc_20201013.mrc.xml.gz
>   4. https://data.dnb.de/GND/authorities-person_dnbmarc_20201013.mrc.xml.gz
>   5. 
> https://data.dnb.de/GND/authorities-sachbegriff_dnbmarc_20201013.mrc.xml.gz
>   6. https://data.dnb.de/GND/authorities-werk_dnbmarc_20201013.mrc.xml.gz
>
> So I have 6 data sets from the downloaded data. For each of these data sets I 
> create a separate BaseX database (with options INTPARSE=true and STRIPNS=true 
> ... other options are not touched, so that they have their default values). 
> The databases are named "geografikum", "koerperschaft", "kongress", "person", 
> "sachbegriff" and "werk". I have to mention that the data set "person" is 
> quite big (~20 GB). The others are between 1 GB and 6 GB in size. The 
> databases are not altered before the execution, so all indexes (ATTRINDEX, 
> TEXTINDEX) should be up to date.
>
> The xQuery (see below) that I execute does this:
>   1. Get ID numbers from two other databases ("db1" and "db2" in the xQuery 
> below). I get 144560 ID numbers in total.
>   2. Iterate over the 6 databases created from the data sets mentioned above 
> and get all records that have one of the 144560 ID numbers that were 
> retrieved before.
>   3. Count the retreived records from the 6 databases. The result is 134388 
> records.
>
> As mentioned: The execution (especially the first run) on the virtual server 
> is very slow compared to the execution on the local PC. The final goal is not 
> to count the retrieved records, but to create a new database from them. I 
> just encountered the performance issues when I did some testing with 
> "count()". But creating the new database has the same performance issues.
>
> So now I basically have 3 questions:
>   1. Why is the first execution of the xQuery so slow and the second one 
> faster, especially on the server?
>   2. Why is the execution on the server much slower than on the local PC?
>   3. I there any way to speed that up? Would it e. g. make sense to split up 
> the very big dataset "person" into multiple databases? Could that big dataset 
> be the bottleneck?
>
> Finally, here is the xQuery that is executed:
>
> (: Get IDs from 2 other databases :)
> let $otherDBs := ('db1', 'db2')
> let $idsInOtherDBs :=
>   (# db:enforceindex #) {
>   for $otherDB in $otherDBs
>     return distinct-values(db:open($otherDB)/collection/record
>     /datafield[@tag=('689', '100', '110', '111', '700', '710', '711')]
>     /subfield[@code='0'])
>   }
>
> (: Get records from 6 datasets by ID from other databases :)
> let $searchDBs := ('geografikum', 'koerperschaft', 'kongress', 'person', 
> 'sachbegriff', 'werk')
> let $result :=
>   (# db:enforceindex #) {
>   for $searchDB in $searchDBs
>     return
>       (db:text($searchDB, $idsInOtherDBs)
>         /parent::element(subfield))[(@attribute(code)='a')]
>         /parent::element(datafield)[(@attribute(tag)='035')]
>         /parent::element(record)
>       union
>       (db:text($searchDB, $idsInOtherDBs)
>         /parent::element(controlfield))[(@attribute(tag)='001')]
>         /parent::element(record)
>   }
>
> (: Count the records :)
> return count($result)
>
> As always: Thank you very much for any suggestions!
>
> Michael

Reply via email to