I am importing a large (~3 million) set of XML documents to BaseX and am 
running into some problems as the databases grow beyond a few gigabytes. I am 
using BaseX 8.4.4, and have increased the memory available to BaseX to 12288m 
(12 Gb) using -Xmx (the machine has 20 Gb total). The documents are being 
stored in several databases that range in size from 300 Mb up to 25 Gb.

My question is: what is an optimal maximum database size (in gigabytes)? I am 
hoping for some general advice. I appreciate that the answer can vary depending 
various factors.

The problems that I've encountered with the larger databases are:

1. Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an out 
of memory error. I have switched to running CREATE INDEX to separately create 
text, attribute, token, and fulltext indexes, and found that creating these 
indexes separately produces fewer out of memory errors. I would like to be able 
to use OPTIMIZE ALL because over time some documents will be removed from the 
databases and the documentation indicates that optimize all will remove stale 
information from the indexes.

2. The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie up 
the machine for a long time, maybe due to heavy disk access.

3. As the database grows in size the rate at which documents are added slows 
down. I have been measuring the number of documents imported, and observed 
rates over 100 documents per minute, and typical rates are around 60 - 30 
documents per minute. As the database grows over a few gigabytes the speed 
slows to around 20 documents per minute. This is not much of a problem because 
when I see the rate slow down I can start a new database. Unfortunately I have 
been recording the number of documents, not the database size.


In case this information is useful, my project is structured as follows:


*         There is 1 central index database which records for each document the 
BaseX database name and path where a document is stored, and some metadata that 
we use to identify or locate documents.

*         There are multiple content databases to store the actual documents. 
These content databases are organized by DTD and time period.

*         Each insert is done using the BaseX REST API. A BaseX HTTP server 
instance is running to receive documents, and a basex instance is running from 
the command line to locate and provide documents. Each insert is done by a POST 
that includes data and an updating query which adds (using db:replace) to the 
central index database and a document database in one "transaction". This helps 
to make the import resilient to duplicate documents, and to any problem that 
can prevent a single from document being added, and allows the process to 
continue if interrupted.

I will probably need to re-organize the content databases so that each database 
is only a few gigabytes in size. Does anyone have advice on what would be a 
good maximum size for each database?

Thanks,
Vincent


Vincent M. Lizzi - Electronic Production Manager
Taylor & Francis Group
530 Walnut St., Suite 850, Philadelphia, PA 19106
E-Mail: 
vincent.li...@taylorandfrancis.com<mailto:vincent.li...@taylorandfrancis.com>
Phone: 215-606-4221
Fax: 215-207-0047
Web: http://www.tandfonline.com/

Taylor & Francis is a trading name of Informa UK Limited,
registered in England under no. 1072954

"Everything should be made as simple as possible, but not simpler."

Reply via email to