On Mon, Sep 4, 2023 at 4:38 AM William Kenworthy <bi...@iinet.net.au> wrote:
>
> On 4/9/23 16:04, Nuno Silva wrote:
> >
> > (But note that Rich was suggesting using the *search* feature of the
> > gitweb interface, which, in this case, also finds the same topmost
> > commit if I search for "reedsolomon".)
> >
> tkx, missed that!

Note that in terms of indexing git and CVS have their pros and cons,
because they use different data structures.  I've heard the saying
that Git is a data structure masquerading as an SCM, and certainly the
inconsistencies in the command line operations bear that out.  Git
tends to be much more useful in general, but for things like finding
deleted files CVS was definitely more time-efficient.

The reason for this is that everything in git is reachable via
commits, and these are reachable from a head via a linked list.  The
most recent commit gives access to the current version of the
repository, and a pointer to the immediately previous commit(s).  To
find a deleted file, git must go to the most recent commit in whatever
branch you are searching, then descend its tree to look for the file.
If it is not found, it then goes to the previous commit and descends
that tree.  There are 745k commits in the active Gentoo repository.  I
think there are something like 2M of them in the historical one.  Each
commit is a random seek, and then each step down the directory tree to
find a file is another random seek.

In CVS everything is organized first by file, and then each file has
its own commit history.  So finding a file, deleted or otherwise, just
requires a seek for each level in the directory tree.  Then you can
directly read its history.

So finding an old deleted file in the gentoo git repo can require
millions of reads, while doing so in CVS only required about 3.  It is
no surprise that the web interfaces were designed to make that
operation much easier - if you do sufficiently complex searches in the
git web interface it will time you out to avoid bogging down the
server, which is why some searches may require you to clone the repo
and do it locally.

Now, if you want to find out what changed in a particular commit the
situation is reversed.  If you identify a commit in git and want to
see what changed, it can directly read the commit from disk using its
hash.  It then looks at the parent commit, then descends both trees
doing a diff at each level.  Since everything is content-hashed only
directory trees that contain differences need to be read.  If a commit
had changes to 50 files, it might only take 10 reads to figure out
which files changed, and then another 100 to compare the contents of
each file and generate diffs.  If you wanted to do that in CVS you'd
have to read every single file in the repository and read the
sequential history of each file to find any commits that have the same
time/author.  CVS commits also aren't atomic so ordering across files
might not be the same.

Git is a thing of beauty when you think about what it was designed to
do and how well-suited to this design its architecture is.  The same
can be said of several data-driven FOSS applications.  The right
algorithm can make a huge difference...

-- 
Rich

Reply via email to