Hello Guido:

On Wed, 02 May 2007, Guido Pelzer wrote:

>   Specially designed indexes to provide Google-like search speeds
>   for repositories of up to 1,500,000 records
>
> what happens over 1,500,000 records?  mysql says: The maximum
> effective table size for MySQL databases is usually determined by
> operating system constraints on file sizes, not by MySQL internal
> limits.

Technically speaking, nothing prevents Invenio from storing more
records than that, but the price to pay for doing so would be rather
slow indexing times.  The bibXXx storage split and the indexing
processes were designed in a way as to make user-seen searching fast
at the expense of admin-seen indexing speeds.  I estimate the current
design is reasonable from the admin-seen indexing speed point of view
for repositories of up to about 1,500,000.  (However, note that if you
use more aggressive stemming (we don't), then you can already
comfortably handle more records than that.)

To push up this "comfort limit" will require some changes to the
indexer.  We are actually looking at this very issue in connection to
upgrading to MySQL 4.1 and 64-bit OS, because our table sizes
multiplied greatly in the process, making indexing slower.  There are
two solutions: (a) phase out MySQL 4.0 support and handle column types
and character sets "properly"; (b) store indexes in a more
economically viable structure (Numeric is not good at storing bit
vectors); (c) abolish bibXXx architecture in profit of idxPHRASE;
(d) introduce parallel storage architecture to store parts of data and
indexes on separate (and possibly replicated) nodes.

I'm actually going to look shortly at the former two options because
we need it here at CERN rather urgently.  I assume this might enable
us to push up the "comfort limit" beyond 1,500,000 records too.  The
latter two options will be explored later.

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to