Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQuery

BIRKNER Michael Fri, 08 May 2020 04:32:26 -0700

Hi Christian,


thank you for your answers. As you can guess the queries I sent in my original 
email are just simplified  examples.


The real XML structure is like the following (its library data in format 
"MarcXML", here you see an example: 
https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml)


db1: each of the 7489 documents has this structure


<collection>

 <record>

   <controlfield tag="001">ID-Number</controlfield>

   ... [more tags named "controlfield" or "datafield"]

 </record>

 ... [more records]

</collection>


So in db1 I have 7489 documents each with a 
"<collection><record>...</record></collection>" structure, so I have 7489 
"collection" nodes.


db2: It's the same structure as above, but there is only 1 "collection" and all 
"records" are within that "collection".


Some background information:

In db1 I save updated versions of records (downloaded from an OAI-PMH 
interface, which gives me only 50 records at a time, so I have to page through 
the results and get 7489 XML-files in the end that I import into db1) that also 
(partly) exist in db2. So there are multiple records with the same ID (normally 
only 2 [the original and the updated one, but there could be the case when 
there are 3 or more records with the same ID because the downloaded updates 
could contain multiple records with the same ID [an updated one and an update 
of the updated one and so on ... I know ... complicated]).

One of the records with the same ID is the newest one. My goal is to find the 
newest one and delete the others (based on a timestamp that is also found in 
another <controlfield> in the record). So all of this is about updating records 
in an existing database from downloaded update-files that I get via OAI.


I hope this information helps. And thank you for pointing out the new version 
9.3.3. I will try that one.


Best regards,

Michael



Mag. Michael Birkner
AK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669

michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>
wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/>

Besuchen Sie uns auch auf:
facebook<http://www.facebook.com/arbeiterkammer/> | 
twitter<https://twitter.com/Arbeiterkammer> | 
youtube<https://www.youtube.com/user/AKoesterreich>
--------------------------------------------------
Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein.
Damals. Heute. Für immer.

arbeiterkammer.at/100<https://arbeiterkammer.at/100><https://arbeiterkammer.at/100><https://w.ak.at/zukunftsprogramm>


________________________________
Von: Christian Grün <christian.gr...@gmail.com>
Gesendet: Freitag, 8. Mai 2020 12:37
An: BIRKNER Michael
Cc: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when 
executing specific xQuery

I tried to reproduce your use case by creating some sample data (with a few 
millions of entries), but both the query plan and the performance were similar 
in 9.2.4 and the current 9.3.3 beta version.

And I am still trying to understand your example query. Is it correct that the 
attribute of your exampletag element have static ids, and the text value of the 
exampletag element contains an id as well? If you can provide me with some 
example documents of your database, that might help us to track down the 
problem.

And feel free to check out the latest stable snapshot [1]. BaseX 9.3.3 is 
close, and lots of new optimizations and rewritings have been added since 
9.3.2, so maybe the problem you encountered is already fixed.

[1] http://files.basex.org/releases/latest/




On Fri, May 8, 2020 at 10:19 AM BIRKNER Michael 
<michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>> wrote:

Hi,

I am observing a performance loss between BaseX versions 9.2.4 (which I was 
using so far) and 9.3.2 (to which I updated recently) when executing an xQuery 
like this:

---
(: Open 2 databases and get all <record>s :)
let $recsFromDb1  := db:open('db1')/record
let $recsFromDb2 := db:open('db2')/record

(: Get distinct IDs of all records in db1 :)
let $idsFromRecsInDb1 := 
distinct-values($recsFromDb1/exampletag[@exampleattr='id'])

(: Iterate over the distinct IDs of db1 and return the records from db2 with 
the same ID :)
for $id in $idsFromRecsInDb1
  let $recFromDb2WithSameId := $recsFromDb2[exampletag[@exampleattr='id']=$id]
  return $recFromDb2WithSameId
---

In BaseX version 9.2.4 the query executes very fast (2 - 3 seconds). In 9.3.2 I 
didn't wait to the end ... I aborted after several minutes because I suspected 
that something must be wrong.

Both BaseX instances have allocated the same amount of memory (4096MB). The 
databases (db1 and db2) were created in the respective BaseX version from 
scratch and contain attribute and text indexes. They were optimized before 
executing the query mentioned above. All options and preferences are the same 
in both BaseX instances. I am using the GUI in Ubuntu 18.04.

Here are some more details about the two databases:

db1:
- Size: 2255MB
- Nodes: 97598775
- Documents: 7489
- Uptodate: true

db2:
- Size: 883MB
- Nodes: 46317512
- Documents: 1
- Uptodate: true

Does someone have an idea why there is such a difference in performance between 
the two BaseX versions?

Thanks for any answers and hints!

Best regards,
Michael



Mag. Michael Birkner
AK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669

michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>
wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/>

Besuchen Sie uns auch auf:
facebook<http://www.facebook.com/arbeiterkammer/> | 
twitter<https://twitter.com/Arbeiterkammer> | 
youtube<https://www.youtube.com/user/AKoesterreich>
--------------------------------------------------
Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein.
Damals. Heute. Für immer.

arbeiterkammer.at/100<https://arbeiterkammer.at/100><https://arbeiterkammer.at/100><https://w.ak.at/zukunftsprogramm>
[https://wien.arbeiterkammer.at/ak100_maildisclaimer.png]<https://arbeiterkammer.at/100>
Beachten Sie, dass Sie uns ab sofort unter einer geänderten Rufnummer 
erreichen. Bitte speichern Sie gleich Ihren Kontakt zur AK Wien ein unter 501 
65 1, gefolgt von der gewohnten Durchwahl.
Dieses Mail ist ausschließlich für die Verwendung durch die/den darin genannten 
AdressatInnen bestimmt und kann vertrauliche bzw rechtlich geschützte 
Informationen enthalten, deren Verwendung ohne Genehmigung durch den/ die 
AbsenderIn rechtswidrig sein kann.
Falls Sie dieses Mail irrtümlich erhalten haben, informieren Sie uns bitte und 
löschen Sie die Nachricht.
UID: ATU 16209706 I https://wien.arbeiterkammer.at/datenschutz

Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQuery

Reply via email to