A design for improved & simplified caching

Edward K. Ream Wed, 10 Feb 2010 06:27:16 -0800

I'll implement the following for Leo 4.8, probably first in the
until-4-7-final branch.  There is no hurry to do this--the present
caching scheme is quite good as it is.  It should *not* be done for
Leo 4.7: the trunk will contain only bug fixes until Leo 4.7 final
goes out the door.


At the end of the "Improved caching for rc1?" thread I said:

QQQ
It's ironic that all the recent caching work has added essentially
nothing to Leo's caching capabilities. However, I am quite pleased
with the work, for several reasons
QQQ

I neglected the most important reason.  In the process of immersing
myself in the lowest-level details of the code, I primed my
subconscious to think expansively about the problem.  This is an
example of what I call "contraction followed by expansion" thinking.
It is only after getting stuck that one can get unstuck.

Sitting in the bath last night, I considered what the present scheme
does, and how it could be improved.  Here is a revision of the notes I
made after the bath.

Terminology: **top-level folder** are direct subfolders of .leo/db.

Top-level folders represent file *locations* not file contents.

The names of top-level folders have the form x_y, where x is the the
short file name and y is a hashlib key corresponding to the full path
to the file.

Exception: the top-level "globals" folder represents g.app.db. This
contains minor data.

At present, top-level folders contain various subdirectories.  Details
don't matter, because we can dispense with them all.  This is the
substance of the new design.

In the new design, a top-level folder will contain only two files:

contents_<key>: the contents of the file.  Call this the **contents**
file.

data_<key>: a dict representing the "minor data" of the file:
<globals> element stuff, expansion bits, etc.  Call this the **data**
file.

Here <key> is the hashlib key (returned by cacher.fileKey) of the
entire contents of the file.

The top-level folder will contain cached data only for the latest
version of a file.  If Leo should somehow try to load an older version
of cached file, the cacher class will reload the entire file, as it
should.  But this will seldom if ever happen.

For any top-level directory, and for any particular <key>, Leo will
only ever write the contents file once.  The proof is immediate.  The
<key> depends on the entire contents of the file.

Otoh, Leo (that is, the cacher), can write data_<key> as many times
desired.  The Aha: this is perfectly safe.  The data in the data file
can never get out-of-sync with the contents of the contents file
because the <key> would change.

Rather than writing "minor" data to a plethora of directories and
files, the cacher will write a single dict containing all minor data
to the data file.  It's as simple as that: the cacher class can easily
"queue" all data for writing.

That's it.

Imo, there are no down sides to the new scheme.  The up sides:

- It will be easier for humans to understand the contents of the cache
and to understand file modification dates.

- This scheme will simplify or even eliminate the complex path-
manipulation code in PickleShareDB.  The cacher will only ever create
top-level directories.

- At present, the clear-all-caches command is wimpy.  In the new
scheme it can *safely* clear all top-level directories.

- The cacher can safely use g.makeAllNonExistentDirectories to make
top-level directories.  This can be unit tested safely as well.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/leo-editor?hl=en.

A design for improved & simplified caching

Reply via email to