#773: data cacher: introduce more efficient timestamp verifications
-------------------------+----------------------
 Reporter:  simko        |      Owner:  rajimene
     Type:  enhancement  |     Status:  new
 Priority:  major        |  Milestone:
Component:  MiscUtil     |    Version:
 Keywords:               |
-------------------------+----------------------
 Several data cachers use `get_table_update_time()` in order to check
 whether data has changed in DB and so whether they have to reload the
 cache.  `get_table_update_time()` is implemented via, basically, `SHOW
 TABLE STATUS` SQL command.  Testing showed that such a timestamp verifier
 is actually rather time consuming and so may be problematic when an
 Invenio installation is heavily bombarded by many concurrent user
 requests.

 For example, here is an old October 2010 comparison:

 {{{
 In [3]: %timeit x = run_sql("SHOW TABLE STATUS LIKE 'idxINDEX';")
 1000 loops, best of 3: 1.8 ms per loop

 In [4]: %timeit y = run_sql("SELECT MAX(last_updated) FROM idxINDEX;")
 1000 loops, best of 3: 217 µs per loop

 In [5]: %timeit z = run_sql("SELECT UPDATE_TIME FROM
 INFORMATION_SCHEMA.TABLES WHERE table_name='idxINDEX';")
 100 loops, best of 3: 2.08 ms per loop
 }}}

 It is therefore interesting to amend existing data cachers to replace
 `get_table_update_time()` technique by a simpler, faster SELECT SQL
 statements in existing data cacher definitions, whenever possible.

 1) Sometimes we may be able to use some existing timestamp columns
 sometimes, such as in the fictive example using `idxINDEX.last_updated`
 above.

 2) When not, then we can introduce new timestamp columns in existing
 tables.  Note that this could sometimes lead to inefficient repetition of
 column values, e.g. when we would like to add `last_updated` to all the
 rows of the `collection` table, provided that webcoll generates only
 single timestamp for all collections anyway.

 3) If it is not possible to transform definitions into effective SQL
 queries with the current definitions and/or after addition of columns as
 described in the previous two techniques, for example because we would
 have to study many tables in order to detect the final timestamp value,
 then it is will be good to have a generic data cache table that would hold
 timestamps for the most performance critical data cachers. The table may
 look like this:

 {{{
   datacacher
   ---------
   id
   name
   definition
   timestamp
 }}}

 The data cachers in the application would then run the following simple
 SQL statement:

 {{{
    SELECT timestamp FROM datacacher WHERE name='collection_webpage';
 }}}

 in order to get the latest DB timestamp so as to verify whether the local
 collection webpage cache is to be reloaded or not.

 Note that this requires a new data cacher timestamp gatherer daemon to be
 running in the background periodically.  This daemon would be permanently
 scanning data cacher definitions and filling datacacher.timestamp column
 values accordingly.  For example, the collection webpage data cacher
 (currently used in webcoll only, but see Carmen's forthcoming branch) uses
 the following tables to collect its timestamp:

 {{{
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('bibrec'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('bibfmt'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('idxWORD%'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('collection%'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('portalbox'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('field%'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('format%'))
 websearch_webcoll.py:
 database_tables_timestamps.append(get_table_update_time('rnkMETHODNAME'))
 }}}

 Therefore, in order to populate the datacacher table, statements like
 these are to be executed on the background periodically.

 4) Usually the situation is not that convoluted; most data cachers consult
 one-two tables only, so for these data cachers the technique number 3
 would not be necessary; it should be sufficient to use techniques 1 and 2
 in these cases.  However, having a generic store for any future data
 cachers would be nice to have anyway.  It is a part of this ticket to
 verify how many cases of 1, 2 vs 3 there would be in order to see the
 importance and the spread of a generic technique like 3.

 5) Note that the datacacher table would be ultra small, holding mostly the
 list of registered data cachers and their DB time stamps, together with
 their definitions and anything suitable for running of the date cacher
 timestamp gatherer daemon, but nothing much else.  It would usually not
 hold any other runtime data, these are usually stored elsewhere, say in
 the citation tables in case of the citation daemon.

 (Just some (older) musings in writing before we re-discuss IRL.)

-- 
Ticket URL: <http://invenio-software.org/ticket/773>
Invenio <http://invenio-software.org>

Reply via email to