Hi Fabrice, thank you for your fast response.

I'm sorry, the info on file size I wrote was incorrect, the correct info of the 
test is:

    number of files:  17828 files
    total size:       955 MB
    medium size:      ~ 55 KB/file

The exact steps I did were:

1. Open BaseX GUI.
2. Create an empty database: Database > new, then I remove the "Input file or 
directory" content (so it gets emty), then I enter a "testdb" in "Name of 
database", and then I click on "Ok".
3. Then I open the created database: Database > Open and manage > double click 
on "testdb".
4. Then I add 3 XML files: Database > Properties, then I write into "Input file 
or directory" a directoy path that contains 3 XML files, then in "Target Path" 
I write the name of the directory (that will be a prefix path for the document 
nodes), and then I click on "Add...". Files are added correctly, and I can make 
queries on them.
5. Then I add the 17828 XML files: Database > Properties, then I write into 
"Input file or directory" a directoy path that contains the 17828 XML files, 
then in "Target Path" I write the name of the directory (that will be a prefix 
path for the document nodes), and then I click on "Add...". Files are not added 
correctly: I get an out of memory message:

    Out of Main Memory.
    You can try to:
    - increase Java's heap size with the flag -Xmx<size>
    - deactivate the text and attribute indexes.

Now I have tried to follow your suggestions:

* The ADDCACHE option seems to be available only in BaseX 7.7, as described in 
http://docs.basex.org/wiki/Options#ADDCACHE . I'm using 7.6, that is the 
version downloaded from the home page of BaseX. In fact, I don't see any 
ADDCACHE option in the GUI (perhaps it is available only in the commandline). I 
would like to make the tests only with a totally stable version. Do you 
recommend me to use 7.7?

* I have tried to optimize: After step 4, I have clicked on the "Optimize" 
button. But I still get an "Out of Main Memory" error when running step 5.

* I don't see in the GUI any option to enable/disable the AUTOFLUSH option. 
Perhaps it is only available in the command line?

* I'm going to try the command line ("basex" command). I will post the results.

* I still haven't learned XQuery Update. I think I should focus on the problem 
of adding our documents to the database, for now, and then (when the problem is 
solved) I will try to optimize things (such as the time taken for opening the 
collection). 

Thank you for your suggestions. I'm going to try the "basex" command now ...


freesoft



________________________________
 De: Fabrice Etanchaud <[email protected]>
Para: freesoft <[email protected]>; "[email protected]" 
<[email protected]> 
Enviado: Lunes 15 de abril de 2013 10:59
Asunto: RE: [basex-talk] Adding millions of XML files
 


 
Hi « kgfhjjgrn » ;-)
 
The size of your test should not cause any problem to BaseX (18 000 files from 
1 up to 5 KB)
 
1.       Did you try to set the ADDCACHE option ? 
2.       You should OPTIMIZE your collection after each batch of ADD commands, 
even if no index is set.
3.       Did you try to unset the AUTOFLUSH option, and explicitly FLUSH the 
updates at batch’s end ?
4.       The GUI may not be the best place to run updates, did you try the 
basex command line tools ?
 
Opening a collection containing a huge number of documents may take a long time 
from my experience.
It seems to be related to the kind of memory data structure used to store the 
document names.
A workaround could be to insert your documents under a common root xml element 
with XQuery Update.
 
The excellent BaseX Team may give you better advice.
 
Best,
Fabrice Etanchaud
Questel-Orbit
 
 
De :[email protected] 
[mailto:[email protected]] De la part de freesoft
Envoyé : lundi 15 avril 2013 10:19
À : [email protected]
Objet : [basex-talk] Adding millions of XML files
 
Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm evaluating BaseX 
to store our XML files and make queries on them. We have to store about 1 
million of XML files per month. The XML files are little (~1 KB to 5 KB). So 
our case is: High number of files, little size.

I've read that BaseX is scalable and has high performance, so it is probably a 
good tool for us. But, in the tests I'm doing, I'm getting an "Out of Main 
Memory" error when loading high number of XML files.

For exaple, if I create a new database ("testdb"), and add 3 XML files, no 
problem occurs. Files are stored correctly, and I can make queries on them. 
Then, if I try to add 18000 XML files to the same database ("testdb") (by using 
GUI > Database > Properties
 > Add Resources), then I see how the coloured memory bar grows and grows... 
 > until an error appears:

    Out of Main Memory.
    You can try to:
    - increase Java's heap size with the flag -Xmx<size>
    - deactivate the text and attribute indexes.

The text and attribute indexes are disabled, so it is not the cause. And I 
increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat 
script), and same error happens.

Probaly BaseX loads all files to main memory first, and then dumps them to the 
database files. That shouldn't be done in that way. For each XML file, it 
should be loaded into main memory, then procesed and then dumped to the db 
files. For each file, independently
 from the rest.

So I have two questions:
1. Do I have to use an special way to add high number of XML files?
2. Is BaseX sufficiently stable to store and manage our data (about 1 million 
of files will be added per month)?

Thank you for our help and for your great software, and excuse me if I am 
asking for recurrent questions.
_______________________________________________
BaseX-Talk mailing list
[email protected]
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Reply via email to