I'm trying to index 10 million small XML-like documents, extracted from an Oracle database.
Lucene version is 1.2, and I'm using RedHat 7.0 Advanced Server, on an AMD XP1800+ with 1GB RAM and 46GB+120GB hard disks. The database is on a separate machine, connected by thin JDBC.
Each document consists of around 10 fields of the form
<key>0001234567</key> <field1>some text</field1> <field2>more text</field2> ... <field10> again</field10>
Each document is typically only around 2K in size. Each field is free-text indexed, but only the "key" field is stored.
After experimenting, I've set Java memory to 750MB writer.mergeFactor = 10000 - and run an optimize every 50,000 documents to try to keep down the number of open files.
So far, the furthest I've managed to get is 3 million docs, which used
22,913 simultaneous open files (the maximum on Windows is 2035!) Up to 79 GB of disk space Around 18 hours of indexing time
Is what I'm trying to do realistic? Does anyone else have experience of indexing collections this large?
My indexing code, which is based on the simple file demo, can be seen at http://www.rogerford.info/docs/IndexDBData_java.txt
- Many thanks,
Roger.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]