Some performance numbers Java Version: 1.3_01 OS Version: Windows 2000 CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU RAM: 512 MB Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single) Number of source documents: 103009 Total filesize of source documents: 430MB Average filesize of source documents (in KB/MB): 4.3KB Source documents storage location (filesystem, DB, http,etc): Filesystem File type of source documents: xml Parser(s) used, if any: Standard Analyzer Number of Fields per document: 8 Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec (139 min) Time taken / 1000 docs indexed: 81 sec / 1000 docs Notes (any special tuning/strategies): I convert each document to a DOM, and use xpath to get the fields. I perform validation on the data and make sure that it meets certain criteria like total size > 150 characters, and verify there are no duplicates using a Hashmap. Without these checks, the indexing goes faster (about 60 seconds/1000 docs).
I hope this is helpful. --Peter -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
