Greets,
As Lucy indexes are modified, they will move forward in discrete steps, each
of which will present a coherent point-in-time view of the index data.
Generically speaking, such point-in-time views of data are often called
"snapshots".
Since index files, once written, are never modified, a list of all the files
included in a "snapshot" is sufficient to describe it completely.
I propose that the master file which defines the snapshot be named the
"snapshot" file, that its primary purpose be to provide a list of files, that
it be encoded as human-readable JSON, and that we publish a public class named
Lucy::Index::Snapshot to control it.
In Lucene, the file which defines the snapshot is the binary "segments" file,
which contains a list of the active segments, along with some metadata
describing the characteristics of each segment. The files which make up the
snapshot are implied by the segment names, though the association isn't
perfect: for instance, the "doc store" files contain data which may be
referenced by more than one segment. This approach has several drawbacks.
Listing files is superior to listing segments, first because no kludges are
required to deal with extra-segment files like Lucene's doc stores, but also
because it allows pluggable index components greater flexibility. The only
way that the "segments" model can be extended to handle arbitrary files is to
add special case code to core classes. In contrast, the list-of-files model
allows individual components to manage their own data files, calling
Snapshot_Add_Entry() when new files are added during indexing, and
Snapshot_Delete_Entry() during merging when the plugin can determine that a
file is truly no longer needed.
Snapshot_Delete_Entry() does not delete the file from the index folder; all it
does is remove the filename from the next snapshot to be written. Once the
new snapshot has been committed, it is possible to identify candidates for
deletion by determining which files are present in the old snapshot file but
gone from the new one.
The Lucene "segments" file uses a custom binary format. The amount of data in
stored by the "segments" file does not justify a binary encoding or the
maintenance burden of the one-off code needed to write and parse it, and it is
more difficult to debug than a human-readable format would be. Here's how a
file written by the prototype Snapshot implementation in KS looks:
{
"entries" : [
"schema_1.json",
"seg_3",
"seg_3/lexicon-3.dat",
"seg_3/lexicon-3.ix",
"seg_3/lexicon-3.ixix",
"seg_3/postings-3.dat",
"seg_3/postings.skip",
"seg_3/records.dat",
"seg_3/records.ix",
"seg_3/segmeta.json",
"seg_3/term_vectors.dat",
"seg_3/term_vectors.ix"
],
"format" : "1"
}
The Snapshot class is simple enough that it might arguably belong within
Lucy::Store instead of Lucy::Index. However, the possibility that some
back-end engines might not adhere to the snapshot-style update policy, as
recently discussed in the "real time updates" thread, seems to indicate that
we may not want to embed snapshots too deeply in our virtual file system.
Prototype code:
http://tinyurl.com/proto-snapshot-bp
http://tinyurl.com/proto-snapshot-c
HTML presentation of public API documentation for Perl binding:
http://tinyurl.com/snapshot-dev-docs
Marvin Humphrey