otis 2002/12/11 22:23:48 Modified: docs benchmarks.html contributions.html demo.html demo2.html demo3.html demo4.html fileformats.html gettingstarted.html index.html luceneplan.html powered.html queryparsersyntax.html resources.html todo.html whoweare.html docs/lucene-sandbox index.html docs/lucene-sandbox/indyo tutorial.html docs/lucene-sandbox/larm overview.html xdocs benchmarks.xml Log: - Modified docs. Revision Changes Path 1.3 +324 -248 jakarta-lucene/docs/benchmarks.html Index: benchmarks.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/benchmarks.html,v retrieving revision 1.2 retrieving revision 1.3 diff -u -r1.2 -r1.3 --- benchmarks.html 4 Dec 2002 05:56:32 -0000 1.2 +++ benchmarks.html 12 Dec 2002 06:23:47 -0000 1.3 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> @@ -121,20 +122,20 @@ <tr><td> <blockquote> <p> - The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. - </p> + The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. + </p> <p> - If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - <a href="benchmarktemplate.xml">template</a>. - </p> + If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + <a href="benchmarktemplate.xml">template</a>. + </p> </blockquote> </p> </td></tr> @@ -149,64 +150,64 @@ <tr><td> <blockquote> <p> - <ul> - <p> - <b>Hardware Environment</b><br /> - <li><i>Dedicated machine for indexing</i>: Self-explanatory -(yes/no)</li> - <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li> - <li><i>RAM</i>: Self-explanatory</li> - <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, -RAID-1, RAID-5)</li> - </p> - <p> - <b>Software environment</b><br /> - <li><i>Java Version</i>: Version of Java SDK/JRE that is run -</li> - <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li> - <li><i>OS Version</i>: Self-explanatory</li> - <li><i>Location of index</i>: Is the index stored in filesystem -or database? Is it on the same server(local) or - over the network?</li> - </p> - <p> - <b>Lucene indexing variables</b><br /> - <li><i>Number of source documents</i>: Number of documents being -indexed</li> - <li><i>Total filesize of source documents</i>: -Self-explanatory</li> - <li><i>Average filesize of source documents</i>: -Self-explanatory</li> - <li><i>Source documents storage location</i>: Where are the -documents being indexed located? - Filesystem, DB, http,etc</li> - <li><i>File type of source documents</i>: Types of files being -indexed, e.g. HTML files, XML files, PDF files, etc.</li> - <li><i>Parser(s) used, if any</i>: Parsers used for parsing the -various files for indexing, - e.g. XML parser, HTML parser, etc.</li> - <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li> - <li><i>Number of fields per document</i>: Number of Fields each -Document contains</li> - <li><i>Type of fields</i>: Type of each field</li> - <li><i>Index persistence</i>: Where the index is stored, e.g. -FSDirectory, SqlDirectory, etc</li> - </p> - <p> - <b>Figures</b><br /> - <li><i>Time taken (in ms/s as an average of at least 3 indexing -runs)</i>: Time taken to index all files</li> - <li><i>Time taken / 1000 docs indexed</i>: Time taken to index -1000 files</li> - <li><i>Memory consumption</i>: Self-explanatory</li> - </p> - <p> - <b>Notes</b><br /> - <li><i>Notes</i>: Any comments which don't belong in the above, -special tuning/strategies, etc</li> - </p> - </ul> - </p> + <ul> + <p> + <b>Hardware Environment</b><br /> + <li><i>Dedicated machine for indexing</i>: Self-explanatory + (yes/no)</li> + <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li> + <li><i>RAM</i>: Self-explanatory</li> + <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)</li> + </p> + <p> + <b>Software environment</b><br /> + <li><i>Java Version</i>: Version of Java SDK/JRE that is run + </li> + <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li> + <li><i>OS Version</i>: Self-explanatory</li> + <li><i>Location of index</i>: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?</li> + </p> + <p> + <b>Lucene indexing variables</b><br /> + <li><i>Number of source documents</i>: Number of documents being + indexed</li> + <li><i>Total filesize of source documents</i>: + Self-explanatory</li> + <li><i>Average filesize of source documents</i>: + Self-explanatory</li> + <li><i>Source documents storage location</i>: Where are the + documents being indexed located? + Filesystem, DB, http,etc</li> + <li><i>File type of source documents</i>: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.</li> + <li><i>Parser(s) used, if any</i>: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.</li> + <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li> + <li><i>Number of fields per document</i>: Number of Fields each + Document contains</li> + <li><i>Type of fields</i>: Type of each field</li> + <li><i>Index persistence</i>: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc</li> + </p> + <p> + <b>Figures</b><br /> + <li><i>Time taken (in ms/s as an average of at least 3 indexing + runs)</i>: Time taken to index all files</li> + <li><i>Time taken / 1000 docs indexed</i>: Time taken to index + 1000 files</li> + <li><i>Memory consumption</i>: Self-explanatory</li> + </p> + <p> + <b>Notes</b><br /> + <li><i>Notes</i>: Any comments which don't belong in the above, + special tuning/strategies, etc</li> + </p> + </ul> + </p> </blockquote> </p> </td></tr> @@ -221,17 +222,17 @@ <tr><td> <blockquote> <p> - These benchmarks have been kindly submitted by Lucene users for -reference purposes. - </p> - <p><b>We make NO guarantees regarding their accuracy or -validity.</b> - </p> - <p>We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). - </p> + These benchmarks have been kindly submitted by Lucene users for + reference purposes. + </p> + <p><b>We make NO guarantees regarding their accuracy or + validity.</b> + </p> + <p>We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). + </p> <table border="0" cellspacing="0" cellpadding="2" width="100%"> <tr><td bgcolor="#828DA6"> <font color="#ffffff" face="arial,helvetica,sanserif"> @@ -241,109 +242,109 @@ <tr><td> <blockquote> <ul> - <p> - <b>Hardware Environment</b><br /> - <li><i>Dedicated machine for indexing</i>: yes</li> - <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li> - <li><i>RAM</i>: 512 DDR</li> - <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li> - </p> - <p> - <b>Software environment</b><br /> - <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li> - <li><i>Java VM</i>: </li> - <li><i>OS Version</i>: Debian Linux 2.4.18-686</li> - <li><i>Location of index</i>: local</li> - </p> - <p> - <b>Lucene indexing variables</b><br /> - <li><i>Number of source documents</i>: Random generator. Set -to make 1M documents -in 2x500,000 batches.</li> - <li><i>Total filesize of source documents</i>: > 1GB if -stored</li> - <li><i>Average filesize of source documents</i>: 1KB</li> - <li><i>Source documents storage location</i>: Filesystem</li> - <li><i>File type of source documents</i>: Generated</li> - <li><i>Parser(s) used, if any</i>: </li> - <li><i>Analyzer(s) used</i>: Default</li> - <li><i>Number of fields per document</i>: 11</li> - <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li> - <li><i>Index persistence</i>: FSDirectory</li> - </p> - <p> - <b>Figures</b><br /> - <li><i>Time taken (in ms/s as an average of at least 3 -indexing runs)</i>: </li> - <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li> - <li><i>Memory consumption</i>:</li> - </p> - <p> - <b>Notes</b><br /> - <li><i>Notes</i>: - <p> - A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).<br /> - These were submitted via a socket connection (open throughout - indexing process).<br /> - The index writer was not closed between index calls.<br /> - This created a 400Mb index in 23 files (after -optimization).<br /> - </p> - <p> - <u>Query details</u>:<br /> - </p> - <p> - Set up a threaded class to start x number of simultaneous -threads to - search the above created index. - </p> - <p> - Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] - </p> - <p> - This query counted 34000 documents and I limited the returned -documents - to 5. - </p> - <p> - This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. - </p> - <pre> - Threads|Avg Time per query (ms) - 1 1009ms - 2 2043ms - 3 3087ms - 4 4045ms - .. . - .. . - 10 10091ms - </pre> - <p> - I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! - </p> - <p>Other query optimizations made little difference.</p></li> - </p> - </ul> + <p> + <b>Hardware Environment</b><br /> + <li><i>Dedicated machine for indexing</i>: yes</li> + <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li> + <li><i>RAM</i>: 512 DDR</li> + <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li> + </p> + <p> + <b>Software environment</b><br /> + <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Debian Linux 2.4.18-686</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br /> + <li><i>Number of source documents</i>: Random generator. Set + to make 1M documents + in 2x500,000 batches.</li> + <li><i>Total filesize of source documents</i>: > 1GB if + stored</li> + <li><i>Average filesize of source documents</i>: 1KB</li> + <li><i>Source documents storage location</i>: Filesystem</li> + <li><i>File type of source documents</i>: Generated</li> + <li><i>Parser(s) used, if any</i>: </li> + <li><i>Analyzer(s) used</i>: Default</li> + <li><i>Number of fields per document</i>: 11</li> + <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li> + <li><i>Index persistence</i>: FSDirectory</li> + </p> + <p> + <b>Figures</b><br /> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: </li> + <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li> + <li><i>Memory consumption</i>:</li> + </p> + <p> + <b>Notes</b><br /> + <li><i>Notes</i>: + <p> + A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).<br /> + These were submitted via a socket connection (open throughout + indexing process).<br /> + The index writer was not closed between index calls.<br /> + This created a 400Mb index in 23 files (after + optimization).<br /> + </p> + <p> + <u>Query details</u>:<br /> + </p> + <p> + Set up a threaded class to start x number of simultaneous + threads to + search the above created index. + </p> + <p> + Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] + </p> + <p> + This query counted 34000 documents and I limited the returned + documents + to 5. + </p> + <p> + This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. + </p> + <pre> + Threads|Avg Time per query (ms) + 1 1009ms + 2 2043ms + 3 3087ms + 4 4045ms + .. . + .. . + 10 10091ms + </pre> + <p> + I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! + </p> + <p>Other query optimizations made little difference.</p></li> + </p> + </ul> <p> - Hamish can be contacted at hamish at catalyst.net.nz. - </p> + Hamish can be contacted at hamish at catalyst.net.nz. + </p> </blockquote> </td></tr> <tr><td><br/></td></tr> @@ -357,71 +358,146 @@ <tr><td> <blockquote> <ul> - <p> - <b>Hardware Environment</b><br /> - <li><i>Dedicated machine for indexing</i>: No, but nominal -usage at time of indexing.</li> - <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li> - <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li> - <li><i>Drive configuration</i>: RAID 5 on Fibre Channel -Array</li> - </p> - <p> - <b>Software environment</b><br /> - <li><i>Java Version</i>: 1.3.1_06</li> - <li><i>Java VM</i>: </li> - <li><i>OS Version</i>: Winnt 4/Sp6</li> - <li><i>Location of index</i>: local</li> - </p> - <p> - <b>Lucene indexing variables</b><br /> - <li><i>Number of source documents</i>: about 60K</li> - <li><i>Total filesize of source documents</i>: 6.5GB</li> - <li><i>Average filesize of source documents</i>: 100K -(6.5GB/60K documents)</li> - <li><i>Source documents storage location</i>: filesystem on -NTFS</li> - <li><i>File type of source documents</i>: </li> - <li><i>Parser(s) used, if any</i>: Currently the only parser -used is the Quiotix html - parser.</li> - <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li> - <li><i>Number of fields per document</i>: 8</li> - <li><i>Type of fields</i>: All strings, and all are stored -and indexed.</li> - <li><i>Index persistence</i>: FSDirectory</li> - </p> - <p> - <b>Figures</b><br /> - <li><i>Time taken (in ms/s as an average of at least 3 -indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.</li> - <li><i>Time taken / 1000 docs indexed</i>: </li> - <li><i>Memory consumption</i>: JVM is given 256MB and uses it -all.</li> - </p> - <p> - <b>Notes</b><br /> - <li><i>Notes</i>: - <p> - We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. - </p></li> - </p> - </ul> + <p> + <b>Hardware Environment</b><br /> + <li><i>Dedicated machine for indexing</i>: No, but nominal + usage at time of indexing.</li> + <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li> + <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li> + <li><i>Drive configuration</i>: RAID 5 on Fibre Channel + Array</li> + </p> + <p> + <b>Software environment</b><br /> + <li><i>Java Version</i>: 1.3.1_06</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Winnt 4/Sp6</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br /> + <li><i>Number of source documents</i>: about 60K</li> + <li><i>Total filesize of source documents</i>: 6.5GB</li> + <li><i>Average filesize of source documents</i>: 100K + (6.5GB/60K documents)</li> + <li><i>Source documents storage location</i>: filesystem on + NTFS</li> + <li><i>File type of source documents</i>: </li> + <li><i>Parser(s) used, if any</i>: Currently the only parser + used is the Quiotix html + parser.</li> + <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li> + <li><i>Number of fields per document</i>: 8</li> + <li><i>Type of fields</i>: All strings, and all are stored + and indexed.</li> + <li><i>Index persistence</i>: FSDirectory</li> + </p> + <p> + <b>Figures</b><br /> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.</li> + <li><i>Time taken / 1000 docs indexed</i>: </li> + <li><i>Memory consumption</i>: JVM is given 256MB and uses it + all.</li> + </p> + <p> + <b>Notes</b><br /> + <li><i>Notes</i>: + <p> + We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. + </p></li> + </p> + </ul> + <p> + Justin can be contacted at tvxh-lw4x at spamex.com. + </p> + </blockquote> + </td></tr> + <tr><td><br/></td></tr> + </table> + <table border="0" cellspacing="0" cellpadding="2" width="100%"> + <tr><td bgcolor="#828DA6"> + <font color="#ffffff" face="arial,helvetica,sanserif"> + <a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's benchmarks</strong></a> + </font> + </td></tr> + <tr><td> + <blockquote> + <p> + My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. + </p> + <ul> + <p> + <b>Hardware Environment</b><br /> + <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li> + <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li> + <li><i>RAM</i>: 4 GB Memory</li> + <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li> + </p> + <p> + <b>Software environment</b><br /> + <li><i>Java Version</i>: 1.3.1</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Sun 5.8 (64 bit)</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br /> + <li><i>Number of source documents</i>: 13,820,517</li> + <li><i>Total filesize of source documents</i>: 87.3 GB</li> + <li><i>Average filesize of source documents</i>: 6.3 KB</li> + <li><i>Source documents storage location</i>: Filesystem</li> + <li><i>File type of source documents</i>: XML</li> + <li><i>Parser(s) used, if any</i>: </li> + <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li> + <li><i>Number of fields per document</i>: 1 - 31</li> + <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li> + <li><i>Index persistence</i>: FSDirectory</li> + <li><i>Index size</i>: 12.5 GB</li> + </p> + <p> + <b>Figures</b><br /> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li> + <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li> + <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer</li> + </p> + <p> + <b>Notes</b><br /> + <li><i>Notes</i>: + <p> + The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. + </p></li> + </p> + </ul> <p> - Justin can be contacted at tvxh-lw4x at spamex.com. - </p> + Daniel can be contacted at Armbrust.Daniel at mayo.edu. + </p> </blockquote> </td></tr> <tr><td><br/></td></tr> 1.17 +1 -0 jakarta-lucene/docs/contributions.html Index: contributions.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/contributions.html,v retrieving revision 1.16 retrieving revision 1.17 diff -u -r1.16 -r1.17 --- contributions.html 4 Dec 2002 05:56:32 -0000 1.16 +++ contributions.html 12 Dec 2002 06:23:47 -0000 1.17 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.13 +1 -0 jakarta-lucene/docs/demo.html Index: demo.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/demo.html,v retrieving revision 1.12 retrieving revision 1.13 diff -u -r1.12 -r1.13 --- demo.html 4 Dec 2002 05:56:32 -0000 1.12 +++ demo.html 12 Dec 2002 06:23:47 -0000 1.13 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.13 +1 -0 jakarta-lucene/docs/demo2.html Index: demo2.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/demo2.html,v retrieving revision 1.12 retrieving revision 1.13 diff -u -r1.12 -r1.13 --- demo2.html 4 Dec 2002 05:56:32 -0000 1.12 +++ demo2.html 12 Dec 2002 06:23:47 -0000 1.13 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.15 +1 -0 jakarta-lucene/docs/demo3.html Index: demo3.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/demo3.html,v retrieving revision 1.14 retrieving revision 1.15 diff -u -r1.14 -r1.15 --- demo3.html 4 Dec 2002 05:56:32 -0000 1.14 +++ demo3.html 12 Dec 2002 06:23:47 -0000 1.15 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.13 +1 -0 jakarta-lucene/docs/demo4.html Index: demo4.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/demo4.html,v retrieving revision 1.12 retrieving revision 1.13 diff -u -r1.12 -r1.13 --- demo4.html 4 Dec 2002 05:56:32 -0000 1.12 +++ demo4.html 12 Dec 2002 06:23:47 -0000 1.13 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.6 +1 -0 jakarta-lucene/docs/fileformats.html Index: fileformats.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v retrieving revision 1.5 retrieving revision 1.6 diff -u -r1.5 -r1.6 --- fileformats.html 4 Dec 2002 05:56:32 -0000 1.5 +++ fileformats.html 12 Dec 2002 06:23:47 -0000 1.6 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.13 +1 -0 jakarta-lucene/docs/gettingstarted.html Index: gettingstarted.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/gettingstarted.html,v retrieving revision 1.12 retrieving revision 1.13 diff -u -r1.12 -r1.13 --- gettingstarted.html 4 Dec 2002 05:56:32 -0000 1.12 +++ gettingstarted.html 12 Dec 2002 06:23:47 -0000 1.13 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.24 +1 -0 jakarta-lucene/docs/index.html Index: index.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/index.html,v retrieving revision 1.23 retrieving revision 1.24 diff -u -r1.23 -r1.24 --- index.html 4 Dec 2002 05:56:32 -0000 1.23 +++ index.html 12 Dec 2002 06:23:47 -0000 1.24 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.14 +1 -0 jakarta-lucene/docs/luceneplan.html Index: luceneplan.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/luceneplan.html,v retrieving revision 1.13 retrieving revision 1.14 diff -u -r1.13 -r1.14 --- luceneplan.html 4 Dec 2002 05:56:32 -0000 1.13 +++ luceneplan.html 12 Dec 2002 06:23:47 -0000 1.14 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.22 +1 -0 jakarta-lucene/docs/powered.html Index: powered.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/powered.html,v retrieving revision 1.21 retrieving revision 1.22 diff -u -r1.21 -r1.22 --- powered.html 4 Dec 2002 05:56:32 -0000 1.21 +++ powered.html 12 Dec 2002 06:23:47 -0000 1.22 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.12 +1 -0 jakarta-lucene/docs/queryparsersyntax.html Index: queryparsersyntax.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/queryparsersyntax.html,v retrieving revision 1.11 retrieving revision 1.12 diff -u -r1.11 -r1.12 --- queryparsersyntax.html 4 Dec 2002 05:56:32 -0000 1.11 +++ queryparsersyntax.html 12 Dec 2002 06:23:47 -0000 1.12 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.20 +1 -0 jakarta-lucene/docs/resources.html Index: resources.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/resources.html,v retrieving revision 1.19 retrieving revision 1.20 diff -u -r1.19 -r1.20 --- resources.html 4 Dec 2002 05:56:32 -0000 1.19 +++ resources.html 12 Dec 2002 06:23:47 -0000 1.20 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.4 +1 -0 jakarta-lucene/docs/todo.html Index: todo.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/todo.html,v retrieving revision 1.3 retrieving revision 1.4 diff -u -r1.3 -r1.4 --- todo.html 4 Dec 2002 05:56:32 -0000 1.3 +++ todo.html 12 Dec 2002 06:23:47 -0000 1.4 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.20 +1 -0 jakarta-lucene/docs/whoweare.html Index: whoweare.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v retrieving revision 1.19 retrieving revision 1.20 diff -u -r1.19 -r1.20 --- whoweare.html 4 Dec 2002 05:56:32 -0000 1.19 +++ whoweare.html 12 Dec 2002 06:23:47 -0000 1.20 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.8 +1 -0 jakarta-lucene/docs/lucene-sandbox/index.html Index: index.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/index.html,v retrieving revision 1.7 retrieving revision 1.8 diff -u -r1.7 -r1.8 --- index.html 4 Dec 2002 05:56:33 -0000 1.7 +++ index.html 12 Dec 2002 06:23:48 -0000 1.8 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.7 +1 -0 jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html Index: tutorial.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html,v retrieving revision 1.6 retrieving revision 1.7 diff -u -r1.6 -r1.7 --- tutorial.html 4 Dec 2002 05:56:33 -0000 1.6 +++ tutorial.html 12 Dec 2002 06:23:48 -0000 1.7 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.6 +1 -0 jakarta-lucene/docs/lucene-sandbox/larm/overview.html Index: overview.html =================================================================== RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/larm/overview.html,v retrieving revision 1.5 retrieving revision 1.6 diff -u -r1.5 -r1.6 --- overview.html 4 Dec 2002 05:56:33 -0000 1.5 +++ overview.html 12 Dec 2002 06:23:48 -0000 1.6 @@ -5,6 +5,7 @@ <!-- start the processing --> <!-- ====================================================================== --> + <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! --> <!-- Main Page Section --> <!-- ====================================================================== --> <html> 1.2 +337 -271 jakarta-lucene/xdocs/benchmarks.xml Index: benchmarks.xml =================================================================== RCS file: /home/cvs/jakarta-lucene/xdocs/benchmarks.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -u -r1.1 -r1.2 --- benchmarks.xml 4 Dec 2002 05:46:43 -0000 1.1 +++ benchmarks.xml 12 Dec 2002 06:23:48 -0000 1.2 @@ -1,283 +1,349 @@ <?xml version="1.0"?> <document> <properties> - <author email="[EMAIL PROTECTED]">Kelvin Tan</author> - <title>Resources - Performance Benchmarks</title> + <author email="[EMAIL PROTECTED]">Kelvin Tan</author> + <title>Resources - Performance Benchmarks</title> </properties> <body> - <section name="Performance Benchmarks"> - <p> - The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. - </p> - <p> - If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - <a href="benchmarktemplate.xml">template</a>. - </p> - </section> - - <section name="Benchmark Variables"> - <p> - <ul> - <p> - <b>Hardware Environment</b><br/> - <li><i>Dedicated machine for indexing</i>: Self-explanatory -(yes/no)</li> - <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li> - <li><i>RAM</i>: Self-explanatory</li> - <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, -RAID-1, RAID-5)</li> - </p> - <p> - <b>Software environment</b><br/> - <li><i>Java Version</i>: Version of Java SDK/JRE that is run -</li> - <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li> - <li><i>OS Version</i>: Self-explanatory</li> - <li><i>Location of index</i>: Is the index stored in filesystem -or database? Is it on the same server(local) or - over the network?</li> - </p> - <p> - <b>Lucene indexing variables</b><br/> - <li><i>Number of source documents</i>: Number of documents being -indexed</li> - <li><i>Total filesize of source documents</i>: -Self-explanatory</li> - <li><i>Average filesize of source documents</i>: -Self-explanatory</li> - <li><i>Source documents storage location</i>: Where are the -documents being indexed located? - Filesystem, DB, http,etc</li> - <li><i>File type of source documents</i>: Types of files being -indexed, e.g. HTML files, XML files, PDF files, etc.</li> - <li><i>Parser(s) used, if any</i>: Parsers used for parsing the -various files for indexing, - e.g. XML parser, HTML parser, etc.</li> - <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li> - <li><i>Number of fields per document</i>: Number of Fields each -Document contains</li> - <li><i>Type of fields</i>: Type of each field</li> - <li><i>Index persistence</i>: Where the index is stored, e.g. -FSDirectory, SqlDirectory, etc</li> - </p> - <p> - <b>Figures</b><br/> - <li><i>Time taken (in ms/s as an average of at least 3 indexing -runs)</i>: Time taken to index all files</li> - <li><i>Time taken / 1000 docs indexed</i>: Time taken to index -1000 files</li> - <li><i>Memory consumption</i>: Self-explanatory</li> - </p> - <p> - <b>Notes</b><br/> - <li><i>Notes</i>: Any comments which don't belong in the above, -special tuning/strategies, etc</li> - </p> - </ul> - </p> - </section> + <section name="Performance Benchmarks"> + <p> + The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. + </p> + <p> + If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + <a href="benchmarktemplate.xml">template</a>. + </p> + </section> - <section name="User-submitted Benchmarks"> - <p> - These benchmarks have been kindly submitted by Lucene users for -reference purposes. - </p> - <p><b>We make NO guarantees regarding their accuracy or -validity.</b> - </p> - <p>We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). - </p> - - <subsection name="Hamish Carpenter's benchmarks"> - <ul> - <p> - <b>Hardware Environment</b><br/> - <li><i>Dedicated machine for indexing</i>: yes</li> - <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li> - <li><i>RAM</i>: 512 DDR</li> - <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li> - </p> - <p> - <b>Software environment</b><br/> - <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li> - <li><i>Java VM</i>: </li> - <li><i>OS Version</i>: Debian Linux 2.4.18-686</li> - <li><i>Location of index</i>: local</li> - </p> - <p> - <b>Lucene indexing variables</b><br/> - <li><i>Number of source documents</i>: Random generator. Set -to make 1M documents -in 2x500,000 batches.</li> - <li><i>Total filesize of source documents</i>: > 1GB if -stored</li> - <li><i>Average filesize of source documents</i>: 1KB</li> - <li><i>Source documents storage location</i>: Filesystem</li> - <li><i>File type of source documents</i>: Generated</li> - <li><i>Parser(s) used, if any</i>: </li> - <li><i>Analyzer(s) used</i>: Default</li> - <li><i>Number of fields per document</i>: 11</li> - <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li> - <li><i>Index persistence</i>: FSDirectory</li> - </p> - <p> - <b>Figures</b><br/> - <li><i>Time taken (in ms/s as an average of at least 3 -indexing runs)</i>: </li> - <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li> - <li><i>Memory consumption</i>:</li> - </p> - <p> - <b>Notes</b><br/> - <li><i>Notes</i>: - <p> - A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).<br/> - These were submitted via a socket connection (open throughout - indexing process).<br/> - The index writer was not closed between index calls.<br/> - This created a 400Mb index in 23 files (after -optimization).<br/> - </p> - <p> - <u>Query details</u>:<br/> - </p> - <p> - Set up a threaded class to start x number of simultaneous -threads to - search the above created index. - </p> - <p> - Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] - </p> - <p> - This query counted 34000 documents and I limited the returned -documents - to 5. - </p> - <p> - This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. - </p> - <pre> - Threads|Avg Time per query (ms) - 1 1009ms - 2 2043ms - 3 3087ms - 4 4045ms - .. . - .. . - 10 10091ms - </pre> - <p> - I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! - </p> - <p>Other query optimizations made little difference.</p></li> - </p> - </ul> - <p> - Hamish can be contacted at hamish at catalyst.net.nz. - </p> - </subsection> + <section name="Benchmark Variables"> + <p> + <ul> + <p> + <b>Hardware Environment</b><br/> + <li><i>Dedicated machine for indexing</i>: Self-explanatory + (yes/no)</li> + <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li> + <li><i>RAM</i>: Self-explanatory</li> + <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)</li> + </p> + <p> + <b>Software environment</b><br/> + <li><i>Java Version</i>: Version of Java SDK/JRE that is run + </li> + <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li> + <li><i>OS Version</i>: Self-explanatory</li> + <li><i>Location of index</i>: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?</li> + </p> + <p> + <b>Lucene indexing variables</b><br/> + <li><i>Number of source documents</i>: Number of documents being + indexed</li> + <li><i>Total filesize of source documents</i>: + Self-explanatory</li> + <li><i>Average filesize of source documents</i>: + Self-explanatory</li> + <li><i>Source documents storage location</i>: Where are the + documents being indexed located? + Filesystem, DB, http,etc</li> + <li><i>File type of source documents</i>: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.</li> + <li><i>Parser(s) used, if any</i>: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.</li> + <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li> + <li><i>Number of fields per document</i>: Number of Fields each + Document contains</li> + <li><i>Type of fields</i>: Type of each field</li> + <li><i>Index persistence</i>: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc</li> + </p> + <p> + <b>Figures</b><br/> + <li><i>Time taken (in ms/s as an average of at least 3 indexing + runs)</i>: Time taken to index all files</li> + <li><i>Time taken / 1000 docs indexed</i>: Time taken to index + 1000 files</li> + <li><i>Memory consumption</i>: Self-explanatory</li> + </p> + <p> + <b>Notes</b><br/> + <li><i>Notes</i>: Any comments which don't belong in the above, + special tuning/strategies, etc</li> + </p> + </ul> + </p> + </section> - <subsection name="Justin Greene's benchmarks"> - <ul> - <p> - <b>Hardware Environment</b><br/> - <li><i>Dedicated machine for indexing</i>: No, but nominal -usage at time of indexing.</li> - <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li> - <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li> - <li><i>Drive configuration</i>: RAID 5 on Fibre Channel -Array</li> - </p> - <p> - <b>Software environment</b><br/> - <li><i>Java Version</i>: 1.3.1_06</li> - <li><i>Java VM</i>: </li> - <li><i>OS Version</i>: Winnt 4/Sp6</li> - <li><i>Location of index</i>: local</li> - </p> - <p> - <b>Lucene indexing variables</b><br/> - <li><i>Number of source documents</i>: about 60K</li> - <li><i>Total filesize of source documents</i>: 6.5GB</li> - <li><i>Average filesize of source documents</i>: 100K -(6.5GB/60K documents)</li> - <li><i>Source documents storage location</i>: filesystem on -NTFS</li> - <li><i>File type of source documents</i>: </li> - <li><i>Parser(s) used, if any</i>: Currently the only parser -used is the Quiotix html - parser.</li> - <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li> - <li><i>Number of fields per document</i>: 8</li> - <li><i>Type of fields</i>: All strings, and all are stored -and indexed.</li> - <li><i>Index persistence</i>: FSDirectory</li> - </p> - <p> - <b>Figures</b><br/> - <li><i>Time taken (in ms/s as an average of at least 3 -indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.</li> - <li><i>Time taken / 1000 docs indexed</i>: </li> - <li><i>Memory consumption</i>: JVM is given 256MB and uses it -all.</li> - </p> - <p> - <b>Notes</b><br/> - <li><i>Notes</i>: - <p> - We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. - </p></li> - </p> - </ul> - <p> - Justin can be contacted at tvxh-lw4x at spamex.com. - </p> - </subsection> + <section name="User-submitted Benchmarks"> + <p> + These benchmarks have been kindly submitted by Lucene users for + reference purposes. + </p> + <p><b>We make NO guarantees regarding their accuracy or + validity.</b> + </p> + <p>We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). + </p> - </section> + <subsection name="Hamish Carpenter's benchmarks"> + <ul> + <p> + <b>Hardware Environment</b><br/> + <li><i>Dedicated machine for indexing</i>: yes</li> + <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li> + <li><i>RAM</i>: 512 DDR</li> + <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li> + </p> + <p> + <b>Software environment</b><br/> + <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Debian Linux 2.4.18-686</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br/> + <li><i>Number of source documents</i>: Random generator. Set + to make 1M documents + in 2x500,000 batches.</li> + <li><i>Total filesize of source documents</i>: > 1GB if + stored</li> + <li><i>Average filesize of source documents</i>: 1KB</li> + <li><i>Source documents storage location</i>: Filesystem</li> + <li><i>File type of source documents</i>: Generated</li> + <li><i>Parser(s) used, if any</i>: </li> + <li><i>Analyzer(s) used</i>: Default</li> + <li><i>Number of fields per document</i>: 11</li> + <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li> + <li><i>Index persistence</i>: FSDirectory</li> + </p> + <p> + <b>Figures</b><br/> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: </li> + <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li> + <li><i>Memory consumption</i>:</li> + </p> + <p> + <b>Notes</b><br/> + <li><i>Notes</i>: + <p> + A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).<br/> + These were submitted via a socket connection (open throughout + indexing process).<br/> + The index writer was not closed between index calls.<br/> + This created a 400Mb index in 23 files (after + optimization).<br/> + </p> + <p> + <u>Query details</u>:<br/> + </p> + <p> + Set up a threaded class to start x number of simultaneous + threads to + search the above created index. + </p> + <p> + Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] + </p> + <p> + This query counted 34000 documents and I limited the returned + documents + to 5. + </p> + <p> + This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. + </p> + <pre> + Threads|Avg Time per query (ms) + 1 1009ms + 2 2043ms + 3 3087ms + 4 4045ms + .. . + .. . + 10 10091ms + </pre> + <p> + I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! + </p> + <p>Other query optimizations made little difference.</p></li> + </p> + </ul> + <p> + Hamish can be contacted at hamish at catalyst.net.nz. + </p> + </subsection> + + <subsection name="Justin Greene's benchmarks"> + <ul> + <p> + <b>Hardware Environment</b><br/> + <li><i>Dedicated machine for indexing</i>: No, but nominal + usage at time of indexing.</li> + <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li> + <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li> + <li><i>Drive configuration</i>: RAID 5 on Fibre Channel + Array</li> + </p> + <p> + <b>Software environment</b><br/> + <li><i>Java Version</i>: 1.3.1_06</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Winnt 4/Sp6</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br/> + <li><i>Number of source documents</i>: about 60K</li> + <li><i>Total filesize of source documents</i>: 6.5GB</li> + <li><i>Average filesize of source documents</i>: 100K + (6.5GB/60K documents)</li> + <li><i>Source documents storage location</i>: filesystem on + NTFS</li> + <li><i>File type of source documents</i>: </li> + <li><i>Parser(s) used, if any</i>: Currently the only parser + used is the Quiotix html + parser.</li> + <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li> + <li><i>Number of fields per document</i>: 8</li> + <li><i>Type of fields</i>: All strings, and all are stored + and indexed.</li> + <li><i>Index persistence</i>: FSDirectory</li> + </p> + <p> + <b>Figures</b><br/> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.</li> + <li><i>Time taken / 1000 docs indexed</i>: </li> + <li><i>Memory consumption</i>: JVM is given 256MB and uses it + all.</li> + </p> + <p> + <b>Notes</b><br/> + <li><i>Notes</i>: + <p> + We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. + </p></li> + </p> + </ul> + <p> + Justin can be contacted at tvxh-lw4x at spamex.com. + </p> + </subsection> + + + <subsection name="Daniel Armbrust's benchmarks"> + <p> + My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. + </p> + <ul> + <p> + <b>Hardware Environment</b><br/> + <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li> + <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li> + <li><i>RAM</i>: 4 GB Memory</li> + <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li> + </p> + <p> + <b>Software environment</b><br/> + <li><i>Java Version</i>: 1.3.1</li> + <li><i>Java VM</i>: </li> + <li><i>OS Version</i>: Sun 5.8 (64 bit)</li> + <li><i>Location of index</i>: local</li> + </p> + <p> + <b>Lucene indexing variables</b><br/> + <li><i>Number of source documents</i>: 13,820,517</li> + <li><i>Total filesize of source documents</i>: 87.3 GB</li> + <li><i>Average filesize of source documents</i>: 6.3 KB</li> + <li><i>Source documents storage location</i>: Filesystem</li> + <li><i>File type of source documents</i>: XML</li> + <li><i>Parser(s) used, if any</i>: </li> + <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li> + <li><i>Number of fields per document</i>: 1 - 31</li> + <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li> + <li><i>Index persistence</i>: FSDirectory</li> + <li><i>Index size</i>: 12.5 GB</li> + </p> + <p> + <b>Figures</b><br/> + <li><i>Time taken (in ms/s as an average of at least 3 + indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li> + <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li> + <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer</li> + </p> + <p> + <b>Notes</b><br/> + <li><i>Notes</i>: + <p> + The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. + </p></li> + </p> + </ul> + <p> + Daniel can be contacted at Armbrust.Daniel at mayo.edu. + </p> + </subsection> + + </section> </body> </document> -
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>