Ahh, good old MAB2 :-) ... but it was as complex as Marc21.
Thank you again for your tests and information. I will test that again with my setup and let you know if I see any differences between your query and my query and what could lead to the performance loss. But probably I will not be able to do it before Monday next week. I will let you know as soon as possible. Best regards and have a nice weekend! Michael Mag. Michael Birkner AK Wien - Bibliothek 1040, Prinz Eugen Straße 20-22 T: +43 1 501 65 12455 F: +43 1 501 65 142455 M: +43 664 88957669 michael.birk...@akwien.at<mailto:michael.birk...@akwien.at> wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/> Besuchen Sie uns auch auf: facebook<http://www.facebook.com/arbeiterkammer/> | twitter<https://twitter.com/Arbeiterkammer> | youtube<https://www.youtube.com/user/AKoesterreich> -------------------------------------------------- Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein. Damals. Heute. Für immer. arbeiterkammer.at/100<https://arbeiterkammer.at/100><https://arbeiterkammer.at/100><https://w.ak.at/zukunftsprogramm> ________________________________ Von: Christian Grün <christian.gr...@gmail.com> Gesendet: Freitag, 8. Mai 2020 14:24 An: BIRKNER Michael Cc: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQuery And I’m always delighted to be confronted with library use case. BaseX grew up with library data; at that time, mostly XML variants of MAB2. I made another intent to reproduce your setting by creating two databases with MARCXML data (rather small, 10.000 and 10 documents each). This is the query I tried: let $recsFromDb1 := db:open('db1')//*:record let $recsFromDb2 := db:open('db2')//*:record let $idsFromRecsInDb1 := distinct-values( $recsFromDb1/*:controlfield[@tag = '001'] ) for $id in $idsFromRecsInDb1 let $recFromDb2WithSameId := $recsFromDb2 [*:controlfield[@tag = '001'] = $id] return $recFromDb2WithSameId Both query plans and execution times are pretty much the same. Can you tell me what I need to change in my query to simulate the slowdown? As a preview, I already have an idea how you can boost the query evaluation (provided your databases have up-to-date index structures)… On Fri, May 8, 2020 at 1:31 PM BIRKNER Michael <michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>> wrote: Hi Christian, thank you for your answers. As you can guess the queries I sent in my original email are just simplified examples. The real XML structure is like the following (its library data in format "MarcXML", here you see an example: https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml) db1: each of the 7489 documents has this structure <collection> <record> <controlfield tag="001">ID-Number</controlfield> ... [more tags named "controlfield" or "datafield"] </record> ... [more records] </collection> So in db1 I have 7489 documents each with a "<collection><record>...</record></collection>" structure, so I have 7489 "collection" nodes. db2: It's the same structure as above, but there is only 1 "collection" and all "records" are within that "collection". Some background information: In db1 I save updated versions of records (downloaded from an OAI-PMH interface, which gives me only 50 records at a time, so I have to page through the results and get 7489 XML-files in the end that I import into db1) that also (partly) exist in db2. So there are multiple records with the same ID (normally only 2 [the original and the updated one, but there could be the case when there are 3 or more records with the same ID because the downloaded updates could contain multiple records with the same ID [an updated one and an update of the updated one and so on ... I know ... complicated]). One of the records with the same ID is the newest one. My goal is to find the newest one and delete the others (based on a timestamp that is also found in another <controlfield> in the record). So all of this is about updating records in an existing database from downloaded update-files that I get via OAI. I hope this information helps. And thank you for pointing out the new version 9.3.3. I will try that one. Best regards, Michael Mag. Michael Birkner AK Wien - Bibliothek 1040, Prinz Eugen Straße 20-22 T: +43 1 501 65 12455 F: +43 1 501 65 142455 M: +43 664 88957669 michael.birk...@akwien.at<mailto:michael.birk...@akwien.at> wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/> Besuchen Sie uns auch auf: facebook<http://www.facebook.com/arbeiterkammer/> | twitter<https://twitter.com/Arbeiterkammer> | youtube<https://www.youtube.com/user/AKoesterreich> -------------------------------------------------- Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein. Damals. Heute. Für immer. arbeiterkammer.at/100<https://arbeiterkammer.at/100><https://arbeiterkammer.at/100><https://w.ak.at/zukunftsprogramm> ________________________________ Von: Christian Grün <christian.gr...@gmail.com<mailto:christian.gr...@gmail.com>> Gesendet: Freitag, 8. Mai 2020 12:37 An: BIRKNER Michael Cc: basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Betreff: Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQuery I tried to reproduce your use case by creating some sample data (with a few millions of entries), but both the query plan and the performance were similar in 9.2.4 and the current 9.3.3 beta version. And I am still trying to understand your example query. Is it correct that the attribute of your exampletag element have static ids, and the text value of the exampletag element contains an id as well? If you can provide me with some example documents of your database, that might help us to track down the problem. And feel free to check out the latest stable snapshot [1]. BaseX 9.3.3 is close, and lots of new optimizations and rewritings have been added since 9.3.2, so maybe the problem you encountered is already fixed. [1] http://files.basex.org/releases/latest/ On Fri, May 8, 2020 at 10:19 AM BIRKNER Michael <michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>> wrote: Hi, I am observing a performance loss between BaseX versions 9.2.4 (which I was using so far) and 9.3.2 (to which I updated recently) when executing an xQuery like this: --- (: Open 2 databases and get all <record>s :) let $recsFromDb1 := db:open('db1')/record let $recsFromDb2 := db:open('db2')/record (: Get distinct IDs of all records in db1 :) let $idsFromRecsInDb1 := distinct-values($recsFromDb1/exampletag[@exampleattr='id']) (: Iterate over the distinct IDs of db1 and return the records from db2 with the same ID :) for $id in $idsFromRecsInDb1 let $recFromDb2WithSameId := $recsFromDb2[exampletag[@exampleattr='id']=$id] return $recFromDb2WithSameId --- In BaseX version 9.2.4 the query executes very fast (2 - 3 seconds). In 9.3.2 I didn't wait to the end ... I aborted after several minutes because I suspected that something must be wrong. Both BaseX instances have allocated the same amount of memory (4096MB). The databases (db1 and db2) were created in the respective BaseX version from scratch and contain attribute and text indexes. They were optimized before executing the query mentioned above. All options and preferences are the same in both BaseX instances. I am using the GUI in Ubuntu 18.04. Here are some more details about the two databases: db1: - Size: 2255MB - Nodes: 97598775 - Documents: 7489 - Uptodate: true db2: - Size: 883MB - Nodes: 46317512 - Documents: 1 - Uptodate: true Does someone have an idea why there is such a difference in performance between the two BaseX versions? Thanks for any answers and hints! Best regards, Michael Mag. Michael Birkner AK Wien - Bibliothek 1040, Prinz Eugen Straße 20-22 T: +43 1 501 65 12455 F: +43 1 501 65 142455 M: +43 664 88957669 michael.birk...@akwien.at<mailto:michael.birk...@akwien.at> wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/> Besuchen Sie uns auch auf: facebook<http://www.facebook.com/arbeiterkammer/> | twitter<https://twitter.com/Arbeiterkammer> | youtube<https://www.youtube.com/user/AKoesterreich> -------------------------------------------------- Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein. Damals. Heute. Für immer. arbeiterkammer.at/100<https://arbeiterkammer.at/100><https://arbeiterkammer.at/100><https://w.ak.at/zukunftsprogramm> [https://wien.arbeiterkammer.at/ak100_maildisclaimer.png]<https://arbeiterkammer.at/100> Beachten Sie, dass Sie uns ab sofort unter einer geänderten Rufnummer erreichen. Bitte speichern Sie gleich Ihren Kontakt zur AK Wien ein unter 501 65 1, gefolgt von der gewohnten Durchwahl. Dieses Mail ist ausschließlich für die Verwendung durch die/den darin genannten AdressatInnen bestimmt und kann vertrauliche bzw rechtlich geschützte Informationen enthalten, deren Verwendung ohne Genehmigung durch den/ die AbsenderIn rechtswidrig sein kann. Falls Sie dieses Mail irrtümlich erhalten haben, informieren Sie uns bitte und löschen Sie die Nachricht. UID: ATU 16209706 I https://wien.arbeiterkammer.at/datenschutz