Hi,
Tibor Simko wrote:
1) One thing to keep in mind though is that some client code might
search in bibfmt table in a non-composite-index-friendly way, which
would result in very slow queries if we delete separate indexes without
offering an alternative. So we may perhaps introduce a primary index as
you described, plus also an additional index for the left-out column if
it is used in stand-alone-column SQL queries. Otherwise we can simply
check every bibfmt SQL to confirm that it uses composite index column
order properly. See for example
<http://dev.mysql.com/doc/refman/4.1/en/multiple-column-indexes.html>
All client code that I've seen uses always both columns id_bibrec and
format. The migration kit I've develop keeps both individual indexes. In
my opinion, we could get rid of them as in all the cases the composite
key is preferred over the individual keys.
2) It may be interesting to check how big the DB table index files are
before and after the change. E.g. for the INSPIRE site, we currently
have:
2.5G bibfmt.MYD
65M bibfmt.MYI
which is perfectly reasonable. In the ADS case, the bibfmt table may be
huge, and if the MYI index file is huge too, it may be perhaps
interesting to investigate options like (i) splitting bibfmt table into
several bibfmt sub-tables, one for every format. Though increasing the
number of tables is not ideal either, so dunno about usefulness of this
option, it may not be worth it. (ii) If MYD/MYI files are not too huge
still, then we can eventually think of non-compressing BLOB values, as
we have spoken about it in Boston with respect to the citation
dictionary. Perhaps the non-compressed size will remain reasonable, and
we shall gain on the compress/uncompress CPU time. I guess your MYD
file is huge already, especially if you cache recstruct too. But maybe
we can use several bibfmt tables that are uncompressed, so a mix of (i)
and (ii). Just thinking of taking this opportunity to do some more
bibfmt-related optimization studies and changes.
I do not currently have read access for the database directory but
Giovanni will probably be able to come back here with stats on the
table/index growth.
3) Also, the ALTER TABLE statements for the update-v0.99.1-tables target
should behave very safely in case there are dupes, e.g. telling user
what to do if ALTER TABLE fails (=remove conflicting rows, recheck
bibxxx values, and rerun bibreformat and/or XM re-generation). Hmm,
maybe you can just do it automatically, since we advise people to take
backup before upgrading to v1.0 anyway, so maybe fixing automatically
all the vulnerable recIDs will do (while reporting them to the user).
If it is not too practical to do this inside update-v0.99.1-tables
Makefile due to limited LOC space, then please write a little migration
kit script, see all the various *_migration_kit.py files we have been
providing.
I pushed my changes to github:
http://github.com/badzil/My-Invenio/tree/bibformat-migration-kit
This includes a migration kit that provides 3 different actions, trying
to minimize the risk of the operation:
* simulation: does nothing and just informs about the existence of
duplicate rows.
* automatic: if duplicates are found, then only the newest one is kept.
* reformat: if duplicates are found, all rows are deleted and
reformatted from scratch.
I also replaced SQL queries concerning the bibfmt table in websearch and
bibupload with API calls (bibformat_dblayer).
Benoit.