Constantine,

I guess it's because commands do not have to maintain  a pending update list,
And can insert data directly in the collection.

A mixed approach could be to use your *.zip iteration to build a command file 
that run one at a time the piece of xquery adding a single archive, in order to 
shorten the PUL.

And you are speaking of more than 500 Gb of data !

Best regards,
Fabrice

De : Hondros, Constantine (ELS-AMS) [mailto:c.hond...@elsevier.com]
Envoyé : lundi 4 mai 2015 14:36
À : Fabrice Etanchaud; basex-talk@mailman.uni-konstanz.de
Objet : RE: Pulling files from multiple zips into one DB

Hi Fabrice,

Indeed my archives contain massive amounts of PDF. However I did a quick 
benchmark and the GUI, using standard options (parse archives, don't add raw 
files) is over 10 (!) times faster to create a DB than my code sample below. 
Not sure why that would be the case.

C.

From: Fabrice Etanchaud [mailto:fetanch...@questel.com]
Sent: 04 May 2015 14:19
To: Hondros, Constantine (ELS-AMS); 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: RE: Pulling files from multiple zips into one DB

If your archives contain a mix of raw and xml files,
Have a look at the old zip module, that may avoid reading the entire archive.

Best regards,
Fabrice

De : 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Fabrice 
Etanchaud
Envoyé : lundi 4 mai 2015 14:12
À : Hondros, Constantine (ELS-AMS); 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Objet : Re: [basex-talk] Pulling files from multiple zips into one DB

Dear Constantine,

In my experience, commands are always faster than db:* calls.
Maybe someone @basex could confirm that, and that commands do not use the 
Pending Update List ?

Are you sure you disabled ADDRAW ?
If there are many raw files along the xml files, you may have better results 
extracting and rearchiving only xml before.
I have the same problem with patent archives, where each xml file may come with 
many pdf and gif.

Best regards,
Fabrice

De : Hondros, Constantine (ELS-AMS) [mailto:c.hond...@elsevier.com]
Envoyé : lundi 4 mai 2015 14:01
À : Fabrice Etanchaud
Objet : RE: Pulling files from multiple zips into one DB

Is that going to be any faster do you think? I tried it and it took a looooong 
time to read through the zips, so I am hoping there might be a faster more 
direct way of doing it.

From: Fabrice Etanchaud [mailto:fetanch...@questel.com]
Sent: 04 May 2015 13:56
To: Hondros, Constantine (ELS-AMS)
Subject: RE: Pulling files from multiple zips into one DB

Hello Constantine,

Why don't you simply create a new collection with ADDARCHIVES=true ?

Best regards,
Fabrice

De : 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Hondros, 
Constantine (ELS-AMS)
Envoyé : lundi 4 mai 2015 13:50
À : 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Objet : [basex-talk] Pulling files from multiple zips into one DB

Hello all,
I need to merge any XML files located in 500 GB of zips into a single DB for 
further analysis. Is there any faster or more efficient way to do it in BaseX 
than this? TIA.

for $zip in file:list($src, false(), '*.zip')
  let $arch := file:read-binary(concat($src, '\', $zip))
  for $a in archive:entries($arch)[ends-with(., 'xml')]
  return db:add('my_db', archive:extract-text($arch, $a), $a)


TIA,
Constantine


________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The 
Netherlands, Registration No. 33156677, Registered in The Netherlands.

________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The 
Netherlands, Registration No. 33156677, Registered in The Netherlands.

________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The 
Netherlands, Registration No. 33156677, Registered in The Netherlands.

Reply via email to