Hello Guido: On Wed, 02 May 2007, Guido Pelzer wrote:
> Specially designed indexes to provide Google-like search speeds > for repositories of up to 1,500,000 records > > what happens over 1,500,000 records? mysql says: The maximum > effective table size for MySQL databases is usually determined by > operating system constraints on file sizes, not by MySQL internal > limits. Technically speaking, nothing prevents Invenio from storing more records than that, but the price to pay for doing so would be rather slow indexing times. The bibXXx storage split and the indexing processes were designed in a way as to make user-seen searching fast at the expense of admin-seen indexing speeds. I estimate the current design is reasonable from the admin-seen indexing speed point of view for repositories of up to about 1,500,000. (However, note that if you use more aggressive stemming (we don't), then you can already comfortably handle more records than that.) To push up this "comfort limit" will require some changes to the indexer. We are actually looking at this very issue in connection to upgrading to MySQL 4.1 and 64-bit OS, because our table sizes multiplied greatly in the process, making indexing slower. There are two solutions: (a) phase out MySQL 4.0 support and handle column types and character sets "properly"; (b) store indexes in a more economically viable structure (Numeric is not good at storing bit vectors); (c) abolish bibXXx architecture in profit of idxPHRASE; (d) introduce parallel storage architecture to store parts of data and indexes on separate (and possibly replicated) nodes. I'm actually going to look shortly at the former two options because we need it here at CERN rather urgently. I assume this might enable us to push up the "comfort limit" beyond 1,500,000 records too. The latter two options will be explored later. Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
