Constantine, I guess it's because commands do not have to maintain a pending update list, And can insert data directly in the collection.
A mixed approach could be to use your *.zip iteration to build a command file that run one at a time the piece of xquery adding a single archive, in order to shorten the PUL. And you are speaking of more than 500 Gb of data ! Best regards, Fabrice De : Hondros, Constantine (ELS-AMS) [mailto:c.hond...@elsevier.com] Envoyé : lundi 4 mai 2015 14:36 À : Fabrice Etanchaud; basex-talk@mailman.uni-konstanz.de Objet : RE: Pulling files from multiple zips into one DB Hi Fabrice, Indeed my archives contain massive amounts of PDF. However I did a quick benchmark and the GUI, using standard options (parse archives, don't add raw files) is over 10 (!) times faster to create a DB than my code sample below. Not sure why that would be the case. C. From: Fabrice Etanchaud [mailto:fetanch...@questel.com] Sent: 04 May 2015 14:19 To: Hondros, Constantine (ELS-AMS); basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: RE: Pulling files from multiple zips into one DB If your archives contain a mix of raw and xml files, Have a look at the old zip module, that may avoid reading the entire archive. Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Fabrice Etanchaud Envoyé : lundi 4 mai 2015 14:12 À : Hondros, Constantine (ELS-AMS); basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Objet : Re: [basex-talk] Pulling files from multiple zips into one DB Dear Constantine, In my experience, commands are always faster than db:* calls. Maybe someone @basex could confirm that, and that commands do not use the Pending Update List ? Are you sure you disabled ADDRAW ? If there are many raw files along the xml files, you may have better results extracting and rearchiving only xml before. I have the same problem with patent archives, where each xml file may come with many pdf and gif. Best regards, Fabrice De : Hondros, Constantine (ELS-AMS) [mailto:c.hond...@elsevier.com] Envoyé : lundi 4 mai 2015 14:01 À : Fabrice Etanchaud Objet : RE: Pulling files from multiple zips into one DB Is that going to be any faster do you think? I tried it and it took a looooong time to read through the zips, so I am hoping there might be a faster more direct way of doing it. From: Fabrice Etanchaud [mailto:fetanch...@questel.com] Sent: 04 May 2015 13:56 To: Hondros, Constantine (ELS-AMS) Subject: RE: Pulling files from multiple zips into one DB Hello Constantine, Why don't you simply create a new collection with ADDARCHIVES=true ? Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Hondros, Constantine (ELS-AMS) Envoyé : lundi 4 mai 2015 13:50 À : basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Objet : [basex-talk] Pulling files from multiple zips into one DB Hello all, I need to merge any XML files located in 500 GB of zips into a single DB for further analysis. Is there any faster or more efficient way to do it in BaseX than this? TIA. for $zip in file:list($src, false(), '*.zip') let $arch := file:read-binary(concat($src, '\', $zip)) for $a in archive:entries($arch)[ends-with(., 'xml')] return db:add('my_db', archive:extract-text($arch, $a), $a) TIA, Constantine ________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands. ________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands. ________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.