#773: data cacher: introduce more efficient timestamp verifications
-------------------------+----------------------
Reporter: simko | Owner: rajimene
Type: enhancement | Status: new
Priority: major | Milestone:
Component: MiscUtil | Version:
Keywords: |
-------------------------+----------------------
Several data cachers use `get_table_update_time()` in order to check
whether data has changed in DB and so whether they have to reload the
cache. `get_table_update_time()` is implemented via, basically, `SHOW
TABLE STATUS` SQL command. Testing showed that such a timestamp verifier
is actually rather time consuming and so may be problematic when an
Invenio installation is heavily bombarded by many concurrent user
requests.
For example, here is an old October 2010 comparison:
{{{
In [3]: %timeit x = run_sql("SHOW TABLE STATUS LIKE 'idxINDEX';")
1000 loops, best of 3: 1.8 ms per loop
In [4]: %timeit y = run_sql("SELECT MAX(last_updated) FROM idxINDEX;")
1000 loops, best of 3: 217 µs per loop
In [5]: %timeit z = run_sql("SELECT UPDATE_TIME FROM
INFORMATION_SCHEMA.TABLES WHERE table_name='idxINDEX';")
100 loops, best of 3: 2.08 ms per loop
}}}
It is therefore interesting to amend existing data cachers to replace
`get_table_update_time()` technique by a simpler, faster SELECT SQL
statements in existing data cacher definitions, whenever possible.
1) Sometimes we may be able to use some existing timestamp columns
sometimes, such as in the fictive example using `idxINDEX.last_updated`
above.
2) When not, then we can introduce new timestamp columns in existing
tables. Note that this could sometimes lead to inefficient repetition of
column values, e.g. when we would like to add `last_updated` to all the
rows of the `collection` table, provided that webcoll generates only
single timestamp for all collections anyway.
3) If it is not possible to transform definitions into effective SQL
queries with the current definitions and/or after addition of columns as
described in the previous two techniques, for example because we would
have to study many tables in order to detect the final timestamp value,
then it is will be good to have a generic data cache table that would hold
timestamps for the most performance critical data cachers. The table may
look like this:
{{{
datacacher
---------
id
name
definition
timestamp
}}}
The data cachers in the application would then run the following simple
SQL statement:
{{{
SELECT timestamp FROM datacacher WHERE name='collection_webpage';
}}}
in order to get the latest DB timestamp so as to verify whether the local
collection webpage cache is to be reloaded or not.
Note that this requires a new data cacher timestamp gatherer daemon to be
running in the background periodically. This daemon would be permanently
scanning data cacher definitions and filling datacacher.timestamp column
values accordingly. For example, the collection webpage data cacher
(currently used in webcoll only, but see Carmen's forthcoming branch) uses
the following tables to collect its timestamp:
{{{
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('bibrec'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('bibfmt'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('idxWORD%'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('collection%'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('portalbox'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('field%'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('format%'))
websearch_webcoll.py:
database_tables_timestamps.append(get_table_update_time('rnkMETHODNAME'))
}}}
Therefore, in order to populate the datacacher table, statements like
these are to be executed on the background periodically.
4) Usually the situation is not that convoluted; most data cachers consult
one-two tables only, so for these data cachers the technique number 3
would not be necessary; it should be sufficient to use techniques 1 and 2
in these cases. However, having a generic store for any future data
cachers would be nice to have anyway. It is a part of this ticket to
verify how many cases of 1, 2 vs 3 there would be in order to see the
importance and the spread of a generic technique like 3.
5) Note that the datacacher table would be ultra small, holding mostly the
list of registered data cachers and their DB time stamps, together with
their definitions and anything suitable for running of the date cacher
timestamp gatherer daemon, but nothing much else. It would usually not
hold any other runtime data, these are usually stored elsewhere, say in
the citation tables in case of the citation daemon.
(Just some (older) musings in writing before we re-discuss IRL.)
--
Ticket URL: <http://invenio-software.org/ticket/773>
Invenio <http://invenio-software.org>