otis        2002/12/11 22:23:48

  Modified:    docs     benchmarks.html contributions.html demo.html
                        demo2.html demo3.html demo4.html fileformats.html
                        gettingstarted.html index.html luceneplan.html
                        powered.html queryparsersyntax.html resources.html
                        todo.html whoweare.html
               docs/lucene-sandbox index.html
               docs/lucene-sandbox/indyo tutorial.html
               docs/lucene-sandbox/larm overview.html
               xdocs    benchmarks.xml
  Log:
  - Modified docs.
  
  Revision  Changes    Path
  1.3       +324 -248  jakarta-lucene/docs/benchmarks.html
  
  Index: benchmarks.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/benchmarks.html,v
  retrieving revision 1.2
  retrieving revision 1.3
  diff -u -r1.2 -r1.3
  --- benchmarks.html   4 Dec 2002 05:56:32 -0000       1.2
  +++ benchmarks.html   12 Dec 2002 06:23:47 -0000      1.3
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  @@ -121,20 +122,20 @@
         <tr><td>
           <blockquote>
                                       <p>
  -      The purpose of these user-submitted performance figures is to 
  -give current and potential users of Lucene a sense 
  -      of how well Lucene scales. If the requirements for an upcoming 
  -project is similar to an existing benchmark, you 
  -      will also have something to work with when designing the system 
  -architecture for the application.
  -      </p>
  +                The purpose of these user-submitted performance figures is to
  +                give current and potential users of Lucene a sense
  +                of how well Lucene scales. If the requirements for an upcoming
  +                project is similar to an existing benchmark, you
  +                will also have something to work with when designing the system
  +                architecture for the application.
  +            </p>
                                                   <p>
  -      If you've conducted performance tests with Lucene, we'd 
  -appreciate if you can submit these figures for display 
  -      on this page. Post these figures to the lucene-user mailing list 
  -using this 
  -      <a href="benchmarktemplate.xml">template</a>.
  -      </p>
  +                If you've conducted performance tests with Lucene, we'd
  +                appreciate if you can submit these figures for display
  +                on this page. Post these figures to the lucene-user mailing list
  +                using this
  +                <a href="benchmarktemplate.xml">template</a>.
  +            </p>
                               </blockquote>
           </p>
         </td></tr>
  @@ -149,64 +150,64 @@
         <tr><td>
           <blockquote>
                                       <p>
  -      <ul>
  -      <p>
  -      <b>Hardware Environment</b><br />
  -      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
  -(yes/no)</li>
  -      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
  -      <li><i>RAM</i>: Self-explanatory</li>
  -      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
  -RAID-1, RAID-5)</li>
  -      </p>
  -      <p>
  -      <b>Software environment</b><br />
  -      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
  -</li>
  -      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
  -      <li><i>OS Version</i>: Self-explanatory</li>
  -      <li><i>Location of index</i>: Is the index stored in filesystem 
  -or database? Is it on the same server(local) or 
  -      over the network?</li>
  -      </p>
  -      <p>
  -      <b>Lucene indexing variables</b><br />
  -      <li><i>Number of source documents</i>: Number of documents being 
  -indexed</li>
  -      <li><i>Total filesize of source documents</i>: 
  -Self-explanatory</li>
  -      <li><i>Average filesize of source documents</i>: 
  -Self-explanatory</li>
  -      <li><i>Source documents storage location</i>: Where are the 
  -documents being indexed located? 
  -        Filesystem, DB, http,etc</li>
  -      <li><i>File type of source documents</i>: Types of files being 
  -indexed, e.g. HTML files, XML files, PDF files, etc.</li>
  -      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
  -various files for indexing, 
  -        e.g. XML parser, HTML parser, etc.</li>
  -      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
  -      <li><i>Number of fields per document</i>: Number of Fields each 
  -Document contains</li>
  -      <li><i>Type of fields</i>: Type of each field</li>
  -      <li><i>Index persistence</i>: Where the index is stored, e.g. 
  -FSDirectory, SqlDirectory, etc</li>
  -      </p>
  -      <p>
  -      <b>Figures</b><br />
  -      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
  -runs)</i>: Time taken to index all files</li>
  -      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
  -1000 files</li>
  -      <li><i>Memory consumption</i>: Self-explanatory</li>
  -      </p>
  -      <p>
  -      <b>Notes</b><br />
  -      <li><i>Notes</i>: Any comments which don't belong in the above, 
  -special tuning/strategies, etc</li>
  -      </p>
  -      </ul>
  -      </p>
  +                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br />
  +                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
  +                            (yes/no)</li>
  +                        <li><i>CPU</i>: Self-explanatory (Type, Speed and 
Quantity)</li>
  +                        <li><i>RAM</i>: Self-explanatory</li>
  +                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
  +                            RAID-1, RAID-5)</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br />
  +                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
  +                        </li>
  +                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
  +                        <li><i>OS Version</i>: Self-explanatory</li>
  +                        <li><i>Location of index</i>: Is the index stored in 
filesystem
  +                            or database? Is it on the same server(local) or
  +                            over the network?</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br />
  +                        <li><i>Number of source documents</i>: Number of documents 
being
  +                            indexed</li>
  +                        <li><i>Total filesize of source documents</i>:
  +                            Self-explanatory</li>
  +                        <li><i>Average filesize of source documents</i>:
  +                            Self-explanatory</li>
  +                        <li><i>Source documents storage location</i>: Where are the
  +                            documents being indexed located?
  +                            Filesystem, DB, http,etc</li>
  +                        <li><i>File type of source documents</i>: Types of files 
being
  +                            indexed, e.g. HTML files, XML files, PDF files, 
etc.</li>
  +                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing 
the
  +                            various files for indexing,
  +                            e.g. XML parser, HTML parser, etc.</li>
  +                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer 
used</li>
  +                        <li><i>Number of fields per document</i>: Number of Fields 
each
  +                            Document contains</li>
  +                        <li><i>Type of fields</i>: Type of each field</li>
  +                        <li><i>Index persistence</i>: Where the index is stored, 
e.g.
  +                            FSDirectory, SqlDirectory, etc</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br />
  +                        <li><i>Time taken (in ms/s as an average of at least 3 
indexing
  +                                runs)</i>: Time taken to index all files</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to 
index
  +                            1000 files</li>
  +                        <li><i>Memory consumption</i>: Self-explanatory</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br />
  +                        <li><i>Notes</i>: Any comments which don't belong in the 
above,
  +                            special tuning/strategies, etc</li>
  +                    </p>
  +                </ul>
  +            </p>
                               </blockquote>
           </p>
         </td></tr>
  @@ -221,17 +222,17 @@
         <tr><td>
           <blockquote>
                                       <p>
  -      These benchmarks have been kindly submitted by Lucene users for 
  -reference purposes. 
  -      </p>
  -                                                <p><b>We make NO guarantees 
regarding their accuracy or 
  -validity.</b>
  -      </p>
  -                                                <p>We strongly recommend you 
conduct your own 
  -      performance benchmarks before deciding on a particular 
  -hardware/software setup (and hopefully submit 
  -      these figures to us).
  -      </p>
  +                These benchmarks have been kindly submitted by Lucene users for
  +                reference purposes.
  +            </p>
  +                                                <p><b>We make NO guarantees 
regarding their accuracy or
  +                    validity.</b>
  +            </p>
  +                                                <p>We strongly recommend you 
conduct your own
  +                performance benchmarks before deciding on a particular
  +                hardware/software setup (and hopefully submit
  +                these figures to us).
  +            </p>
                                                       <table border="0" 
cellspacing="0" cellpadding="2" width="100%">
         <tr><td bgcolor="#828DA6">
           <font color="#ffffff" face="arial,helvetica,sanserif">
  @@ -241,109 +242,109 @@
         <tr><td>
           <blockquote>
                                       <ul>
  -          <p>
  -          <b>Hardware Environment</b><br />
  -          <li><i>Dedicated machine for indexing</i>: yes</li>
  -          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
  -          <li><i>RAM</i>: 512 DDR</li>
  -          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
  -          </p>
  -          <p>
  -          <b>Software environment</b><br />
  -          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
  -          <li><i>Java VM</i>: </li>
  -          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
  -          <li><i>Location of index</i>: local</li>
  -          </p>
  -          <p>
  -          <b>Lucene indexing variables</b><br />
  -          <li><i>Number of source documents</i>: Random generator. Set 
  -to make 1M documents
  -in 2x500,000 batches.</li>
  -          <li><i>Total filesize of source documents</i>: &gt; 1GB if 
  -stored</li>
  -          <li><i>Average filesize of source documents</i>: 1KB</li>
  -          <li><i>Source documents storage location</i>: Filesystem</li>
  -          <li><i>File type of source documents</i>: Generated</li>
  -          <li><i>Parser(s) used, if any</i>: </li>
  -          <li><i>Analyzer(s) used</i>: Default</li>
  -          <li><i>Number of fields per document</i>: 11</li>
  -          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
  -          <li><i>Index persistence</i>: FSDirectory</li>
  -          </p>
  -          <p>
  -          <b>Figures</b><br />
  -          <li><i>Time taken (in ms/s as an average of at least 3 
  -indexing runs)</i>: </li>
  -          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
  -          <li><i>Memory consumption</i>:</li>
  -          </p>
  -          <p>
  -          <b>Notes</b><br />
  -          <li><i>Notes</i>: 
  -          <p>
  -          A windows client ran a random document generator which 
  -created
  -          documents based on some arrays of values and an excerpt 
  -(approx 1kb)
  -          from a text file of the bible (King James version).<br />
  -          These were submitted via a socket connection (open throughout
  -          indexing process).<br />
  -          The index writer was not closed between index calls.<br />
  -          This created a 400Mb index in 23 files (after 
  -optimization).<br />
  -          </p>
  -          <p>
  -          <u>Query details</u>:<br />
  -          </p>
  -          <p>
  -          Set up a threaded class to start x number of simultaneous 
  -threads to
  -          search the above created index.
  -          </p>
  -          <p>
  -          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
  -(Teaser:goo* Tea
  -          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
  -          +DisplayStartDate:[mkwsw2jk0
  -          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
  -          </p>
  -          <p>
  -          This query counted 34000 documents and I limited the returned 
  -documents
  -          to 5.
  -          </p>
  -          <p>
  -          This is using Peter Halacsy's IndexSearcherCache slightly 
  -modified to
  -          be a singleton returned cached searchers for a given 
  -directory. This
  -          solved an initial problem with too many files open and 
  -running out of
  -          linux handles for them.
  -          </p>
  -          <pre>
  -          Threads|Avg Time per query (ms)
  -          1       1009ms
  -          2       2043ms
  -          3       3087ms
  -          4       4045ms
  -          ..        .
  -          ..        .
  -          10      10091ms
  -          </pre>
  -          <p>
  -          I removed the two date range terms from the query and it made 
  -a HUGE
  -          difference in performance. With 4 threads the avg time 
  -dropped to 900ms!
  -          </p>
  -          <p>Other query optimizations made little difference.</p></li>
  -          </p>
  -          </ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br />
  +                        <li><i>Dedicated machine for indexing</i>: yes</li>
  +                        <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
  +                        <li><i>RAM</i>: 512 DDR</li>
  +                        <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br />
  +                        <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br />
  +                        <li><i>Number of source documents</i>: Random generator. Set
  +                            to make 1M documents
  +                            in 2x500,000 batches.</li>
  +                        <li><i>Total filesize of source documents</i>: &gt; 1GB if
  +                            stored</li>
  +                        <li><i>Average filesize of source documents</i>: 1KB</li>
  +                        <li><i>Source documents storage location</i>: 
Filesystem</li>
  +                        <li><i>File type of source documents</i>: Generated</li>
  +                        <li><i>Parser(s) used, if any</i>: </li>
  +                        <li><i>Analyzer(s) used</i>: Default</li>
  +                        <li><i>Number of fields per document</i>: 11</li>
  +                        <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br />
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: </li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
  +                        <li><i>Memory consumption</i>:</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br />
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                A windows client ran a random document generator 
which
  +                                created
  +                                documents based on some arrays of values and an 
excerpt
  +                                (approx 1kb)
  +                                from a text file of the bible (King James 
version).<br />
  +                                These were submitted via a socket connection (open 
throughout
  +                                indexing process).<br />
  +                                The index writer was not closed between index 
calls.<br />
  +                                This created a 400Mb index in 23 files (after
  +                                optimization).<br />
  +                            </p>
  +                            <p>
  +                                <u>Query details</u>:<br />
  +                            </p>
  +                            <p>
  +                                Set up a threaded class to start x number of 
simultaneous
  +                                threads to
  +                                search the above created index.
  +                            </p>
  +                            <p>
  +                                Query:  +Domain:sos +(+((Name:goo*^2.0 
Name:plan*^2.0)
  +                                (Teaser:goo* Tea
  +                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
  +                                +DisplayStartDate:[mkwsw2jk0
  +                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
  +                            </p>
  +                            <p>
  +                                This query counted 34000 documents and I limited 
the returned
  +                                documents
  +                                to 5.
  +                            </p>
  +                            <p>
  +                                This is using Peter Halacsy's IndexSearcherCache 
slightly
  +                                modified to
  +                                be a singleton returned cached searchers for a given
  +                                directory. This
  +                                solved an initial problem with too many files open 
and
  +                                running out of
  +                                linux handles for them.
  +                            </p>
  +                            <pre>
  +                                Threads|Avg Time per query (ms)
  +                                1       1009ms
  +                                2       2043ms
  +                                3       3087ms
  +                                4       4045ms
  +                                ..        .
  +                                ..        .
  +                                10      10091ms
  +                            </pre>
  +                            <p>
  +                                I removed the two date range terms from the query 
and it made
  +                                a HUGE
  +                                difference in performance. With 4 threads the avg 
time
  +                                dropped to 900ms!
  +                            </p>
  +                            <p>Other query optimizations made little 
difference.</p></li>
  +                    </p>
  +                </ul>
                                                   <p>
  -          Hamish can be contacted at hamish at catalyst.net.nz.
  -          </p>
  +                    Hamish can be contacted at hamish at catalyst.net.nz.
  +                </p>
                               </blockquote>
         </td></tr>
         <tr><td><br/></td></tr>
  @@ -357,71 +358,146 @@
         <tr><td>
           <blockquote>
                                       <ul>
  -          <p>
  -          <b>Hardware Environment</b><br />
  -          <li><i>Dedicated machine for indexing</i>: No, but nominal 
  -usage at time of indexing.</li>
  -          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
  -          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
  -          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
  -Array</li>
  -          </p>
  -          <p>
  -          <b>Software environment</b><br />
  -          <li><i>Java Version</i>: 1.3.1_06</li>
  -          <li><i>Java VM</i>: </li>
  -          <li><i>OS Version</i>: Winnt 4/Sp6</li>
  -          <li><i>Location of index</i>: local</li>
  -          </p>
  -          <p>
  -          <b>Lucene indexing variables</b><br />
  -          <li><i>Number of source documents</i>: about 60K</li>
  -          <li><i>Total filesize of source documents</i>: 6.5GB</li>
  -          <li><i>Average filesize of source documents</i>: 100K 
  -(6.5GB/60K documents)</li>
  -          <li><i>Source documents storage location</i>: filesystem on 
  -NTFS</li>
  -          <li><i>File type of source documents</i>: </li>
  -          <li><i>Parser(s) used, if any</i>: Currently the only parser 
  -used is the Quiotix html
  -          parser.</li>
  -          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
  -          <li><i>Number of fields per document</i>: 8</li>
  -          <li><i>Type of fields</i>: All strings, and all are stored 
  -and indexed.</li>
  -          <li><i>Index persistence</i>: FSDirectory</li>
  -          </p>
  -          <p>
  -          <b>Figures</b><br />
  -          <li><i>Time taken (in ms/s as an average of at least 3 
  -indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
  -minutes.  Note that the #
  -          and size of documents changes daily.</li>
  -          <li><i>Time taken / 1000 docs indexed</i>: </li>
  -          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
  -all.</li>
  -          </p>
  -          <p>
  -          <b>Notes</b><br />
  -          <li><i>Notes</i>: 
  -          <p>
  -          We have 10 threads reading files from the filesystem and 
  -parsing and
  -          analyzing them and the pushing them onto a queue and a single 
  -thread poping
  -          them from the queue and indexing.  Note that we are indexing 
  -email messages
  -          and are storing the entire plaintext in of the message in the 
  -index.  If the
  -          message contains attachment and we do not have a filter for 
  -the attachment
  -          (ie. we do not do PDFs yet), we discard the data.
  -          </p></li>
  -          </p>
  -          </ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br />
  +                        <li><i>Dedicated machine for indexing</i>: No, but nominal
  +                            usage at time of indexing.</li>
  +                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
  +                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
  +                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
  +                            Array</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br />
  +                        <li><i>Java Version</i>: 1.3.1_06</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Winnt 4/Sp6</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br />
  +                        <li><i>Number of source documents</i>: about 60K</li>
  +                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
  +                        <li><i>Average filesize of source documents</i>: 100K
  +                            (6.5GB/60K documents)</li>
  +                        <li><i>Source documents storage location</i>: filesystem on
  +                            NTFS</li>
  +                        <li><i>File type of source documents</i>: </li>
  +                        <li><i>Parser(s) used, if any</i>: Currently the only parser
  +                            used is the Quiotix html
  +                            parser.</li>
  +                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
  +                        <li><i>Number of fields per document</i>: 8</li>
  +                        <li><i>Type of fields</i>: All strings, and all are stored
  +                            and indexed.</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br />
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 
minutes and 1 hour 17
  +                            minutes.  Note that the #
  +                            and size of documents changes daily.</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: </li>
  +                        <li><i>Memory consumption</i>: JVM is given 256MB and uses 
it
  +                            all.</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br />
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                We have 10 threads reading files from the 
filesystem and
  +                                parsing and
  +                                analyzing them and the pushing them onto a queue 
and a single
  +                                thread poping
  +                                them from the queue and indexing.  Note that we are 
indexing
  +                                email messages
  +                                and are storing the entire plaintext in of the 
message in the
  +                                index.  If the
  +                                message contains attachment and we do not have a 
filter for
  +                                the attachment
  +                                (ie. we do not do PDFs yet), we discard the data.
  +                            </p></li>
  +                    </p>
  +                </ul>
  +                                                <p>
  +                    Justin can be contacted at tvxh-lw4x at spamex.com.
  +                </p>
  +                            </blockquote>
  +      </td></tr>
  +      <tr><td><br/></td></tr>
  +    </table>
  +                                                    <table border="0" 
cellspacing="0" cellpadding="2" width="100%">
  +      <tr><td bgcolor="#828DA6">
  +        <font color="#ffffff" face="arial,helvetica,sanserif">
  +          <a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's 
benchmarks</strong></a>
  +        </font>
  +      </td></tr>
  +      <tr><td>
  +        <blockquote>
  +                                    <p>
  +                    My disclaimer is that this is a very poor "Benchmark".  It was 
not done for raw speed,
  +                    nor was the total index built in one shot.  The index was 
created on several different
  +                    machines (all with these specs, or very similar), with each 
machine indexing batches of 500,000 to
  +                    1 million documents per batch.  Each of these small indexes was 
then moved to a
  +                    much larger drive, where they were all merged together into a 
big index.
  +                    This process was done manually, over the course of several 
months, as the sources became available.
  +                </p>
  +                                                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br />
  +                        <li><i>Dedicated machine for indexing</i>: no - The machine 
had moderate to low load.  However, the indexing process was built single
  +                            threaded, so it only took advantage of 1 of the 
processors.  It usually got 100% of this processor.</li>
  +                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
  +                        <li><i>RAM</i>: 4 GB Memory</li>
  +                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 
36GB Drive</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br />
  +                        <li><i>Java Version</i>: 1.3.1</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br />
  +                        <li><i>Number of source documents</i>: 13,820,517</li>
  +                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
  +                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
  +                        <li><i>Source documents storage location</i>: 
Filesystem</li>
  +                        <li><i>File type of source documents</i>: XML</li>
  +                        <li><i>Parser(s) used, if any</i>: </li>
  +                        <li><i>Analyzer(s) used</i>: A home grown analyzer that 
simply removes stopwords.</li>
  +                        <li><i>Number of fields per document</i>: 1 - 31</li>
  +                        <li><i>Type of fields</i>: All text, though 2 of them are 
dates (20001205) that we filter on</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                        <li><i>Index size</i>: 12.5 GB</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br />
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: For 617271 documents, 209698 
seconds (or ~2.5 days)</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
  +                        <li><i>Memory consumption</i>: (java executed with) java 
-Xmx1000m -Xss8192k so
  +                            1 GB of memory was allotted to the indexer</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br />
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                The source documents were XML.  The "indexer" 
opened each document one at a time, ran an
  +                                XSL transformation on them, and then proceeded to 
index the stream.  The indexer optimized
  +                                the index every 50,000 documents (on this run) 
though previously, we optimized every
  +                                300,000 documents.  The performance didn't change 
much either way.  We did no other
  +                                tuning (RAM Directories, separate process to 
pretransform the source material, etc)
  +                                to make it index faster.  When all of these 
individual indexes were built, they were
  +                                merged together into the main index.  That process 
usually took ~ a day.
  +                            </p></li>
  +                    </p>
  +                </ul>
                                                   <p>
  -          Justin can be contacted at tvxh-lw4x at spamex.com.
  -          </p>
  +                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
  +                </p>
                               </blockquote>
         </td></tr>
         <tr><td><br/></td></tr>
  
  
  
  1.17      +1 -0      jakarta-lucene/docs/contributions.html
  
  Index: contributions.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/contributions.html,v
  retrieving revision 1.16
  retrieving revision 1.17
  diff -u -r1.16 -r1.17
  --- contributions.html        4 Dec 2002 05:56:32 -0000       1.16
  +++ contributions.html        12 Dec 2002 06:23:47 -0000      1.17
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.13      +1 -0      jakarta-lucene/docs/demo.html
  
  Index: demo.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/demo.html,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- demo.html 4 Dec 2002 05:56:32 -0000       1.12
  +++ demo.html 12 Dec 2002 06:23:47 -0000      1.13
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.13      +1 -0      jakarta-lucene/docs/demo2.html
  
  Index: demo2.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/demo2.html,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- demo2.html        4 Dec 2002 05:56:32 -0000       1.12
  +++ demo2.html        12 Dec 2002 06:23:47 -0000      1.13
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.15      +1 -0      jakarta-lucene/docs/demo3.html
  
  Index: demo3.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/demo3.html,v
  retrieving revision 1.14
  retrieving revision 1.15
  diff -u -r1.14 -r1.15
  --- demo3.html        4 Dec 2002 05:56:32 -0000       1.14
  +++ demo3.html        12 Dec 2002 06:23:47 -0000      1.15
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.13      +1 -0      jakarta-lucene/docs/demo4.html
  
  Index: demo4.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/demo4.html,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- demo4.html        4 Dec 2002 05:56:32 -0000       1.12
  +++ demo4.html        12 Dec 2002 06:23:47 -0000      1.13
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.6       +1 -0      jakarta-lucene/docs/fileformats.html
  
  Index: fileformats.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v
  retrieving revision 1.5
  retrieving revision 1.6
  diff -u -r1.5 -r1.6
  --- fileformats.html  4 Dec 2002 05:56:32 -0000       1.5
  +++ fileformats.html  12 Dec 2002 06:23:47 -0000      1.6
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.13      +1 -0      jakarta-lucene/docs/gettingstarted.html
  
  Index: gettingstarted.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/gettingstarted.html,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- gettingstarted.html       4 Dec 2002 05:56:32 -0000       1.12
  +++ gettingstarted.html       12 Dec 2002 06:23:47 -0000      1.13
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.24      +1 -0      jakarta-lucene/docs/index.html
  
  Index: index.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/index.html,v
  retrieving revision 1.23
  retrieving revision 1.24
  diff -u -r1.23 -r1.24
  --- index.html        4 Dec 2002 05:56:32 -0000       1.23
  +++ index.html        12 Dec 2002 06:23:47 -0000      1.24
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.14      +1 -0      jakarta-lucene/docs/luceneplan.html
  
  Index: luceneplan.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/luceneplan.html,v
  retrieving revision 1.13
  retrieving revision 1.14
  diff -u -r1.13 -r1.14
  --- luceneplan.html   4 Dec 2002 05:56:32 -0000       1.13
  +++ luceneplan.html   12 Dec 2002 06:23:47 -0000      1.14
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.22      +1 -0      jakarta-lucene/docs/powered.html
  
  Index: powered.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/powered.html,v
  retrieving revision 1.21
  retrieving revision 1.22
  diff -u -r1.21 -r1.22
  --- powered.html      4 Dec 2002 05:56:32 -0000       1.21
  +++ powered.html      12 Dec 2002 06:23:47 -0000      1.22
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.12      +1 -0      jakarta-lucene/docs/queryparsersyntax.html
  
  Index: queryparsersyntax.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/queryparsersyntax.html,v
  retrieving revision 1.11
  retrieving revision 1.12
  diff -u -r1.11 -r1.12
  --- queryparsersyntax.html    4 Dec 2002 05:56:32 -0000       1.11
  +++ queryparsersyntax.html    12 Dec 2002 06:23:47 -0000      1.12
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.20      +1 -0      jakarta-lucene/docs/resources.html
  
  Index: resources.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/resources.html,v
  retrieving revision 1.19
  retrieving revision 1.20
  diff -u -r1.19 -r1.20
  --- resources.html    4 Dec 2002 05:56:32 -0000       1.19
  +++ resources.html    12 Dec 2002 06:23:47 -0000      1.20
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.4       +1 -0      jakarta-lucene/docs/todo.html
  
  Index: todo.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/todo.html,v
  retrieving revision 1.3
  retrieving revision 1.4
  diff -u -r1.3 -r1.4
  --- todo.html 4 Dec 2002 05:56:32 -0000       1.3
  +++ todo.html 12 Dec 2002 06:23:47 -0000      1.4
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.20      +1 -0      jakarta-lucene/docs/whoweare.html
  
  Index: whoweare.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v
  retrieving revision 1.19
  retrieving revision 1.20
  diff -u -r1.19 -r1.20
  --- whoweare.html     4 Dec 2002 05:56:32 -0000       1.19
  +++ whoweare.html     12 Dec 2002 06:23:47 -0000      1.20
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.8       +1 -0      jakarta-lucene/docs/lucene-sandbox/index.html
  
  Index: index.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/index.html,v
  retrieving revision 1.7
  retrieving revision 1.8
  diff -u -r1.7 -r1.8
  --- index.html        4 Dec 2002 05:56:33 -0000       1.7
  +++ index.html        12 Dec 2002 06:23:48 -0000      1.8
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.7       +1 -0      jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html
  
  Index: tutorial.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html,v
  retrieving revision 1.6
  retrieving revision 1.7
  diff -u -r1.6 -r1.7
  --- tutorial.html     4 Dec 2002 05:56:33 -0000       1.6
  +++ tutorial.html     12 Dec 2002 06:23:48 -0000      1.7
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.6       +1 -0      jakarta-lucene/docs/lucene-sandbox/larm/overview.html
  
  Index: overview.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/larm/overview.html,v
  retrieving revision 1.5
  retrieving revision 1.6
  diff -u -r1.5 -r1.6
  --- overview.html     4 Dec 2002 05:56:33 -0000       1.5
  +++ overview.html     12 Dec 2002 06:23:48 -0000      1.6
  @@ -5,6 +5,7 @@
           
   <!-- start the processing -->
       <!-- ====================================================================== -->
  +    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
       <!-- Main Page Section -->
       <!-- ====================================================================== -->
       <html>
  
  
  
  1.2       +337 -271  jakarta-lucene/xdocs/benchmarks.xml
  
  Index: benchmarks.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/benchmarks.xml,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- benchmarks.xml    4 Dec 2002 05:46:43 -0000       1.1
  +++ benchmarks.xml    12 Dec 2002 06:23:48 -0000      1.2
  @@ -1,283 +1,349 @@
   <?xml version="1.0"?>
   <document>
       <properties>
  -      <author email="[EMAIL PROTECTED]">Kelvin Tan</author>
  -      <title>Resources - Performance Benchmarks</title>
  +        <author email="[EMAIL PROTECTED]">Kelvin Tan</author>
  +        <title>Resources - Performance Benchmarks</title>
       </properties>
       <body>
   
  -      <section name="Performance Benchmarks">
  -      <p>
  -      The purpose of these user-submitted performance figures is to 
  -give current and potential users of Lucene a sense 
  -      of how well Lucene scales. If the requirements for an upcoming 
  -project is similar to an existing benchmark, you 
  -      will also have something to work with when designing the system 
  -architecture for the application.
  -      </p>
  -      <p>
  -      If you've conducted performance tests with Lucene, we'd 
  -appreciate if you can submit these figures for display 
  -      on this page. Post these figures to the lucene-user mailing list 
  -using this 
  -      <a href="benchmarktemplate.xml">template</a>.
  -      </p>
  -      </section>
  -      
  -      <section name="Benchmark Variables">
  -      <p>
  -      <ul>
  -      <p>
  -      <b>Hardware Environment</b><br/>
  -      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
  -(yes/no)</li>
  -      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
  -      <li><i>RAM</i>: Self-explanatory</li>
  -      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
  -RAID-1, RAID-5)</li>
  -      </p>
  -      <p>
  -      <b>Software environment</b><br/>
  -      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
  -</li>
  -      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
  -      <li><i>OS Version</i>: Self-explanatory</li>
  -      <li><i>Location of index</i>: Is the index stored in filesystem 
  -or database? Is it on the same server(local) or 
  -      over the network?</li>
  -      </p>
  -      <p>
  -      <b>Lucene indexing variables</b><br/>
  -      <li><i>Number of source documents</i>: Number of documents being 
  -indexed</li>
  -      <li><i>Total filesize of source documents</i>: 
  -Self-explanatory</li>
  -      <li><i>Average filesize of source documents</i>: 
  -Self-explanatory</li>
  -      <li><i>Source documents storage location</i>: Where are the 
  -documents being indexed located? 
  -        Filesystem, DB, http,etc</li>
  -      <li><i>File type of source documents</i>: Types of files being 
  -indexed, e.g. HTML files, XML files, PDF files, etc.</li>
  -      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
  -various files for indexing, 
  -        e.g. XML parser, HTML parser, etc.</li>
  -      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
  -      <li><i>Number of fields per document</i>: Number of Fields each 
  -Document contains</li>
  -      <li><i>Type of fields</i>: Type of each field</li>
  -      <li><i>Index persistence</i>: Where the index is stored, e.g. 
  -FSDirectory, SqlDirectory, etc</li>
  -      </p>
  -      <p>
  -      <b>Figures</b><br/>
  -      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
  -runs)</i>: Time taken to index all files</li>
  -      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
  -1000 files</li>
  -      <li><i>Memory consumption</i>: Self-explanatory</li>
  -      </p>
  -      <p>
  -      <b>Notes</b><br/>
  -      <li><i>Notes</i>: Any comments which don't belong in the above, 
  -special tuning/strategies, etc</li>
  -      </p>
  -      </ul>
  -      </p>
  -      </section>
  +        <section name="Performance Benchmarks">
  +            <p>
  +                The purpose of these user-submitted performance figures is to
  +                give current and potential users of Lucene a sense
  +                of how well Lucene scales. If the requirements for an upcoming
  +                project is similar to an existing benchmark, you
  +                will also have something to work with when designing the system
  +                architecture for the application.
  +            </p>
  +            <p>
  +                If you've conducted performance tests with Lucene, we'd
  +                appreciate if you can submit these figures for display
  +                on this page. Post these figures to the lucene-user mailing list
  +                using this
  +                <a href="benchmarktemplate.xml">template</a>.
  +            </p>
  +        </section>
   
  -      <section name="User-submitted Benchmarks">
  -      <p>
  -      These benchmarks have been kindly submitted by Lucene users for 
  -reference purposes. 
  -      </p>
  -      <p><b>We make NO guarantees regarding their accuracy or 
  -validity.</b>
  -      </p>
  -      <p>We strongly recommend you conduct your own 
  -      performance benchmarks before deciding on a particular 
  -hardware/software setup (and hopefully submit 
  -      these figures to us).
  -      </p>
  -      
  -        <subsection name="Hamish Carpenter's benchmarks">
  -          <ul>
  -          <p>
  -          <b>Hardware Environment</b><br/>
  -          <li><i>Dedicated machine for indexing</i>: yes</li>
  -          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
  -          <li><i>RAM</i>: 512 DDR</li>
  -          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
  -          </p>
  -          <p>
  -          <b>Software environment</b><br/>
  -          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
  -          <li><i>Java VM</i>: </li>
  -          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
  -          <li><i>Location of index</i>: local</li>
  -          </p>
  -          <p>
  -          <b>Lucene indexing variables</b><br/>
  -          <li><i>Number of source documents</i>: Random generator. Set 
  -to make 1M documents
  -in 2x500,000 batches.</li>
  -          <li><i>Total filesize of source documents</i>: > 1GB if 
  -stored</li>
  -          <li><i>Average filesize of source documents</i>: 1KB</li>
  -          <li><i>Source documents storage location</i>: Filesystem</li>
  -          <li><i>File type of source documents</i>: Generated</li>
  -          <li><i>Parser(s) used, if any</i>: </li>
  -          <li><i>Analyzer(s) used</i>: Default</li>
  -          <li><i>Number of fields per document</i>: 11</li>
  -          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
  -          <li><i>Index persistence</i>: FSDirectory</li>
  -          </p>
  -          <p>
  -          <b>Figures</b><br/>
  -          <li><i>Time taken (in ms/s as an average of at least 3 
  -indexing runs)</i>: </li>
  -          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
  -          <li><i>Memory consumption</i>:</li>
  -          </p>
  -          <p>
  -          <b>Notes</b><br/>
  -          <li><i>Notes</i>: 
  -          <p>
  -          A windows client ran a random document generator which 
  -created
  -          documents based on some arrays of values and an excerpt 
  -(approx 1kb)
  -          from a text file of the bible (King James version).<br/>
  -          These were submitted via a socket connection (open throughout
  -          indexing process).<br/>
  -          The index writer was not closed between index calls.<br/>
  -          This created a 400Mb index in 23 files (after 
  -optimization).<br/>
  -          </p>
  -          <p>
  -          <u>Query details</u>:<br/>
  -          </p>
  -          <p>
  -          Set up a threaded class to start x number of simultaneous 
  -threads to
  -          search the above created index.
  -          </p>
  -          <p>
  -          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
  -(Teaser:goo* Tea
  -          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
  -          +DisplayStartDate:[mkwsw2jk0
  -          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
  -          </p>
  -          <p>
  -          This query counted 34000 documents and I limited the returned 
  -documents
  -          to 5.
  -          </p>
  -          <p>
  -          This is using Peter Halacsy's IndexSearcherCache slightly 
  -modified to
  -          be a singleton returned cached searchers for a given 
  -directory. This
  -          solved an initial problem with too many files open and 
  -running out of
  -          linux handles for them.
  -          </p>
  -          <pre>
  -          Threads|Avg Time per query (ms)
  -          1       1009ms
  -          2       2043ms
  -          3       3087ms
  -          4       4045ms
  -          ..        .
  -          ..        .
  -          10      10091ms
  -          </pre>
  -          <p>
  -          I removed the two date range terms from the query and it made 
  -a HUGE
  -          difference in performance. With 4 threads the avg time 
  -dropped to 900ms!
  -          </p>
  -          <p>Other query optimizations made little difference.</p></li>
  -          </p>
  -          </ul>
  -          <p>
  -          Hamish can be contacted at hamish at catalyst.net.nz.
  -          </p>
  -        </subsection>     
  +        <section name="Benchmark Variables">
  +            <p>
  +                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br/>
  +                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
  +                            (yes/no)</li>
  +                        <li><i>CPU</i>: Self-explanatory (Type, Speed and 
Quantity)</li>
  +                        <li><i>RAM</i>: Self-explanatory</li>
  +                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
  +                            RAID-1, RAID-5)</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br/>
  +                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
  +                        </li>
  +                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
  +                        <li><i>OS Version</i>: Self-explanatory</li>
  +                        <li><i>Location of index</i>: Is the index stored in 
filesystem
  +                            or database? Is it on the same server(local) or
  +                            over the network?</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br/>
  +                        <li><i>Number of source documents</i>: Number of documents 
being
  +                            indexed</li>
  +                        <li><i>Total filesize of source documents</i>:
  +                            Self-explanatory</li>
  +                        <li><i>Average filesize of source documents</i>:
  +                            Self-explanatory</li>
  +                        <li><i>Source documents storage location</i>: Where are the
  +                            documents being indexed located?
  +                            Filesystem, DB, http,etc</li>
  +                        <li><i>File type of source documents</i>: Types of files 
being
  +                            indexed, e.g. HTML files, XML files, PDF files, 
etc.</li>
  +                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing 
the
  +                            various files for indexing,
  +                            e.g. XML parser, HTML parser, etc.</li>
  +                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer 
used</li>
  +                        <li><i>Number of fields per document</i>: Number of Fields 
each
  +                            Document contains</li>
  +                        <li><i>Type of fields</i>: Type of each field</li>
  +                        <li><i>Index persistence</i>: Where the index is stored, 
e.g.
  +                            FSDirectory, SqlDirectory, etc</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br/>
  +                        <li><i>Time taken (in ms/s as an average of at least 3 
indexing
  +                                runs)</i>: Time taken to index all files</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to 
index
  +                            1000 files</li>
  +                        <li><i>Memory consumption</i>: Self-explanatory</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br/>
  +                        <li><i>Notes</i>: Any comments which don't belong in the 
above,
  +                            special tuning/strategies, etc</li>
  +                    </p>
  +                </ul>
  +            </p>
  +        </section>
   
  -        <subsection name="Justin Greene's benchmarks">
  -          <ul>
  -          <p>
  -          <b>Hardware Environment</b><br/>
  -          <li><i>Dedicated machine for indexing</i>: No, but nominal 
  -usage at time of indexing.</li>
  -          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
  -          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
  -          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
  -Array</li>
  -          </p>
  -          <p>
  -          <b>Software environment</b><br/>
  -          <li><i>Java Version</i>: 1.3.1_06</li>
  -          <li><i>Java VM</i>: </li>
  -          <li><i>OS Version</i>: Winnt 4/Sp6</li>
  -          <li><i>Location of index</i>: local</li>
  -          </p>
  -          <p>
  -          <b>Lucene indexing variables</b><br/>
  -          <li><i>Number of source documents</i>: about 60K</li>
  -          <li><i>Total filesize of source documents</i>: 6.5GB</li>
  -          <li><i>Average filesize of source documents</i>: 100K 
  -(6.5GB/60K documents)</li>
  -          <li><i>Source documents storage location</i>: filesystem on 
  -NTFS</li>
  -          <li><i>File type of source documents</i>: </li>
  -          <li><i>Parser(s) used, if any</i>: Currently the only parser 
  -used is the Quiotix html
  -          parser.</li>
  -          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
  -          <li><i>Number of fields per document</i>: 8</li>
  -          <li><i>Type of fields</i>: All strings, and all are stored 
  -and indexed.</li>
  -          <li><i>Index persistence</i>: FSDirectory</li>
  -          </p>
  -          <p>
  -          <b>Figures</b><br/>
  -          <li><i>Time taken (in ms/s as an average of at least 3 
  -indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
  -minutes.  Note that the #
  -          and size of documents changes daily.</li>
  -          <li><i>Time taken / 1000 docs indexed</i>: </li>
  -          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
  -all.</li>
  -          </p>
  -          <p>
  -          <b>Notes</b><br/>
  -          <li><i>Notes</i>: 
  -          <p>
  -          We have 10 threads reading files from the filesystem and 
  -parsing and
  -          analyzing them and the pushing them onto a queue and a single 
  -thread poping
  -          them from the queue and indexing.  Note that we are indexing 
  -email messages
  -          and are storing the entire plaintext in of the message in the 
  -index.  If the
  -          message contains attachment and we do not have a filter for 
  -the attachment
  -          (ie. we do not do PDFs yet), we discard the data.
  -          </p></li>
  -          </p>
  -          </ul>
  -          <p>
  -          Justin can be contacted at tvxh-lw4x at spamex.com.
  -          </p>
  -        </subsection> 
  +        <section name="User-submitted Benchmarks">
  +            <p>
  +                These benchmarks have been kindly submitted by Lucene users for
  +                reference purposes.
  +            </p>
  +            <p><b>We make NO guarantees regarding their accuracy or
  +                    validity.</b>
  +            </p>
  +            <p>We strongly recommend you conduct your own
  +                performance benchmarks before deciding on a particular
  +                hardware/software setup (and hopefully submit
  +                these figures to us).
  +            </p>
   
  -      </section>
  +            <subsection name="Hamish Carpenter's benchmarks">
  +                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br/>
  +                        <li><i>Dedicated machine for indexing</i>: yes</li>
  +                        <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
  +                        <li><i>RAM</i>: 512 DDR</li>
  +                        <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br/>
  +                        <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br/>
  +                        <li><i>Number of source documents</i>: Random generator. Set
  +                            to make 1M documents
  +                            in 2x500,000 batches.</li>
  +                        <li><i>Total filesize of source documents</i>: > 1GB if
  +                            stored</li>
  +                        <li><i>Average filesize of source documents</i>: 1KB</li>
  +                        <li><i>Source documents storage location</i>: 
Filesystem</li>
  +                        <li><i>File type of source documents</i>: Generated</li>
  +                        <li><i>Parser(s) used, if any</i>: </li>
  +                        <li><i>Analyzer(s) used</i>: Default</li>
  +                        <li><i>Number of fields per document</i>: 11</li>
  +                        <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br/>
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: </li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
  +                        <li><i>Memory consumption</i>:</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br/>
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                A windows client ran a random document generator 
which
  +                                created
  +                                documents based on some arrays of values and an 
excerpt
  +                                (approx 1kb)
  +                                from a text file of the bible (King James 
version).<br/>
  +                                These were submitted via a socket connection (open 
throughout
  +                                indexing process).<br/>
  +                                The index writer was not closed between index 
calls.<br/>
  +                                This created a 400Mb index in 23 files (after
  +                                optimization).<br/>
  +                            </p>
  +                            <p>
  +                                <u>Query details</u>:<br/>
  +                            </p>
  +                            <p>
  +                                Set up a threaded class to start x number of 
simultaneous
  +                                threads to
  +                                search the above created index.
  +                            </p>
  +                            <p>
  +                                Query:  +Domain:sos +(+((Name:goo*^2.0 
Name:plan*^2.0)
  +                                (Teaser:goo* Tea
  +                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
  +                                +DisplayStartDate:[mkwsw2jk0
  +                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
  +                            </p>
  +                            <p>
  +                                This query counted 34000 documents and I limited 
the returned
  +                                documents
  +                                to 5.
  +                            </p>
  +                            <p>
  +                                This is using Peter Halacsy's IndexSearcherCache 
slightly
  +                                modified to
  +                                be a singleton returned cached searchers for a given
  +                                directory. This
  +                                solved an initial problem with too many files open 
and
  +                                running out of
  +                                linux handles for them.
  +                            </p>
  +                            <pre>
  +                                Threads|Avg Time per query (ms)
  +                                1       1009ms
  +                                2       2043ms
  +                                3       3087ms
  +                                4       4045ms
  +                                ..        .
  +                                ..        .
  +                                10      10091ms
  +                            </pre>
  +                            <p>
  +                                I removed the two date range terms from the query 
and it made
  +                                a HUGE
  +                                difference in performance. With 4 threads the avg 
time
  +                                dropped to 900ms!
  +                            </p>
  +                            <p>Other query optimizations made little 
difference.</p></li>
  +                    </p>
  +                </ul>
  +                <p>
  +                    Hamish can be contacted at hamish at catalyst.net.nz.
  +                </p>
  +            </subsection>
  +
  +            <subsection name="Justin Greene's benchmarks">
  +                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br/>
  +                        <li><i>Dedicated machine for indexing</i>: No, but nominal
  +                            usage at time of indexing.</li>
  +                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
  +                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
  +                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
  +                            Array</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br/>
  +                        <li><i>Java Version</i>: 1.3.1_06</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Winnt 4/Sp6</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br/>
  +                        <li><i>Number of source documents</i>: about 60K</li>
  +                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
  +                        <li><i>Average filesize of source documents</i>: 100K
  +                            (6.5GB/60K documents)</li>
  +                        <li><i>Source documents storage location</i>: filesystem on
  +                            NTFS</li>
  +                        <li><i>File type of source documents</i>: </li>
  +                        <li><i>Parser(s) used, if any</i>: Currently the only parser
  +                            used is the Quiotix html
  +                            parser.</li>
  +                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
  +                        <li><i>Number of fields per document</i>: 8</li>
  +                        <li><i>Type of fields</i>: All strings, and all are stored
  +                            and indexed.</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br/>
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 
minutes and 1 hour 17
  +                            minutes.  Note that the #
  +                            and size of documents changes daily.</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: </li>
  +                        <li><i>Memory consumption</i>: JVM is given 256MB and uses 
it
  +                            all.</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br/>
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                We have 10 threads reading files from the 
filesystem and
  +                                parsing and
  +                                analyzing them and the pushing them onto a queue 
and a single
  +                                thread poping
  +                                them from the queue and indexing.  Note that we are 
indexing
  +                                email messages
  +                                and are storing the entire plaintext in of the 
message in the
  +                                index.  If the
  +                                message contains attachment and we do not have a 
filter for
  +                                the attachment
  +                                (ie. we do not do PDFs yet), we discard the data.
  +                            </p></li>
  +                    </p>
  +                </ul>
  +                <p>
  +                    Justin can be contacted at tvxh-lw4x at spamex.com.
  +                </p>
  +            </subsection>
  +
  +
  +            <subsection name="Daniel Armbrust's benchmarks">
  +                <p>
  +                    My disclaimer is that this is a very poor "Benchmark".  It was 
not done for raw speed,
  +                    nor was the total index built in one shot.  The index was 
created on several different
  +                    machines (all with these specs, or very similar), with each 
machine indexing batches of 500,000 to
  +                    1 million documents per batch.  Each of these small indexes was 
then moved to a
  +                    much larger drive, where they were all merged together into a 
big index.
  +                    This process was done manually, over the course of several 
months, as the sources became available.
  +                </p>
  +                <ul>
  +                    <p>
  +                        <b>Hardware Environment</b><br/>
  +                        <li><i>Dedicated machine for indexing</i>: no - The machine 
had moderate to low load.  However, the indexing process was built single
  +                            threaded, so it only took advantage of 1 of the 
processors.  It usually got 100% of this processor.</li>
  +                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
  +                        <li><i>RAM</i>: 4 GB Memory</li>
  +                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 
36GB Drive</li>
  +                    </p>
  +                    <p>
  +                        <b>Software environment</b><br/>
  +                        <li><i>Java Version</i>: 1.3.1</li>
  +                        <li><i>Java VM</i>: </li>
  +                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
  +                        <li><i>Location of index</i>: local</li>
  +                    </p>
  +                    <p>
  +                        <b>Lucene indexing variables</b><br/>
  +                        <li><i>Number of source documents</i>: 13,820,517</li>
  +                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
  +                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
  +                        <li><i>Source documents storage location</i>: 
Filesystem</li>
  +                        <li><i>File type of source documents</i>: XML</li>
  +                        <li><i>Parser(s) used, if any</i>: </li>
  +                        <li><i>Analyzer(s) used</i>: A home grown analyzer that 
simply removes stopwords.</li>
  +                        <li><i>Number of fields per document</i>: 1 - 31</li>
  +                        <li><i>Type of fields</i>: All text, though 2 of them are 
dates (20001205) that we filter on</li>
  +                        <li><i>Index persistence</i>: FSDirectory</li>
  +                        <li><i>Index size</i>: 12.5 GB</li>
  +                    </p>
  +                    <p>
  +                        <b>Figures</b><br/>
  +                        <li><i>Time taken (in ms/s as an average of at least 3
  +                                indexing runs)</i>: For 617271 documents, 209698 
seconds (or ~2.5 days)</li>
  +                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
  +                        <li><i>Memory consumption</i>: (java executed with) java 
-Xmx1000m -Xss8192k so
  +                            1 GB of memory was allotted to the indexer</li>
  +                    </p>
  +                    <p>
  +                        <b>Notes</b><br/>
  +                        <li><i>Notes</i>:
  +                            <p>
  +                                The source documents were XML.  The "indexer" 
opened each document one at a time, ran an
  +                                XSL transformation on them, and then proceeded to 
index the stream.  The indexer optimized
  +                                the index every 50,000 documents (on this run) 
though previously, we optimized every
  +                                300,000 documents.  The performance didn't change 
much either way.  We did no other
  +                                tuning (RAM Directories, separate process to 
pretransform the source material, etc)
  +                                to make it index faster.  When all of these 
individual indexes were built, they were
  +                                merged together into the main index.  That process 
usually took ~ a day.
  +                            </p></li>
  +                    </p>
  +                </ul>
  +                <p>
  +                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
  +                </p>
  +            </subsection>
  +
  +        </section>
   
       </body>
   </document>
  -
  
  
  

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to