Hi John,
[EMAIL PROTECTED] wrote:
I have tried jcr/jackrabbit and like it.
Next I would like to push jackrabbit to its limit:
load in as many items as possible. I would appreciate help on
a few configuration/tuning issues:
(1) which persistent manager to use?
in a recent test I imported over a million wikipedia articles which
resulted in about 6 million items. no versioning, btw.
my configuration is:
dell latitude d505
db-persitence using derby
256m heap
at the beginning the time to add an article was about 5ms.
towards the end of the load the time to add an article was stable at
about 50ms.
some other figures:
db size: 2 GB
index size: 300 MB
(2) what parameters to tune?
I can give you some advice on configuring the index: the default config
will cause lucene to create segments of 100 nodes, which will be merged
when as soon as 10 segments exist. when doing a bulk load you should set
the paramter minMergeDocs to a higher value. e.g. 1000. this will create
segments of 1000 nodes, and will be more efficient.
(3) will multiple wordspaces help?
IMO this might help, if you run into scalability issues with the
persistence manager you are using.
(4) any other things to watch for?
use separate disks for the index and workspace data.
My host has 4GB ram and a few TB diskspace.
Also, any doc describing all possbile elements in repository.xml?
the sample repository.xml file in src/conf contains an inline dtd that
contains some documentation.
And if SearchIndex can be turned off?
yes, this is possible. you simply omit the SearchIndex element in the
configuration. though, I would be very interested to see how well the
index works with your data.
regards
marcel