Please see attached for diff to benchmarks.xml for Daniel's numbers. Thanks Dan!
Regards, Kelvin -------- The book giving manifesto - http://how.to/sharethisbook
cvs -z9 diff benchmarks.xml (in directory C:\checkout\jakarta-lucene\xdocs\) Index: benchmarks.xml =================================================================== RCS file: /home/cvspublic/jakarta-lucene/xdocs/benchmarks.xml,v retrieving revision 1.1 diff -r1.1 benchmarks.xml 278a279,344 > <subsection name="Daniel Armbrust's benchmarks"> > <p> > My disclaimer is that this is a very poor "Benchmark". It was not done >for raw speed, > nor was the total index built in one shot. The index was created on >several different > machines (all with these specs, or very similar), with each machine >indexing batches of 500,000 to > 1 million documents per batch. Each of these small indexes was then moved >to a > much larger drive, where they were all merged together into a big index. > This process was done manually, over the course of several months, as the >sources became available. > </p> > <ul> > <p> > <b>Hardware Environment</b><br/> > <li><i>Dedicated machine for indexing</i>: no - The machine had moderate >to low load. However, the indexing process was built single > threaded, so it only took advantage of 1 of the processors. It usually got 100% of >this processor.</li> > <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li> > <li><i>RAM</i>: 4 GB Memory</li> > <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li> > </p> > <p> > <b>Software environment</b><br/> > <li><i>Java Version</i>: 1.3.1</li> > <li><i>Java VM</i>: </li> > <li><i>OS Version</i>: Sun 5.8 (64 bit)</li> > <li><i>Location of index</i>: local</li> > </p> > <p> > <b>Lucene indexing variables</b><br/> > <li><i>Number of source documents</i>: 13,820,517</li> > <li><i>Total filesize of source documents</i>: 87.3 GB</li> > <li><i>Average filesize of source documents</i>: 6.3 KB</li> > <li><i>Source documents storage location</i>: Filesystem</li> > <li><i>File type of source documents</i>: XML</li> > <li><i>Parser(s) used, if any</i>: </li> > <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes >stopwords.</li> > <li><i>Number of fields per document</i>: 1 - 31</li> > <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) >that we filter on</li> > <li><i>Index persistence</i>: FSDirectory</li> > <li><i>Index size</i>: 12.5 GB</li> > </p> > <p> > <b>Figures</b><br/> > <li><i>Time taken (in ms/s as an average of at least 3 > indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li> > <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li> > <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m >-Xss8192k so > 1 GB of memory was allotted to the indexer</li> > </p> > <p> > <b>Notes</b><br/> > <li><i>Notes</i>: > <p> > The source documents were XML. The "indexer" opened each document one at >a time, ran an > XSL transformation on them, and then proceeded to index the stream. The >indexer optimized > the index every 50,000 documents (on this run) though previously, we >optimized every > 300,000 documents. The performance didn't change much either way. We did >no other > tuning (RAM Directories, separate process to pretransform the source >material, etc) > to make it index faster. When all of these individual indexes were built, >they were > merged together into the main index. That process usually took ~ a day. > </p></li> > </p> > </ul> > <p> > Daniel can be contacted at Armbrust.Daniel at mayo.edu. > </p> > </subsection> >
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>