I've wanted to hack on overhauling revlogs and storage for several months
now. The more I think about how to go about it and the reasons why I want
to do it (mainly performance and scaling), the more I think a more drastic
departure from revlogs and the current storage "backend" is needed. There
are some great properties of revlogs and the direct-addressable model of
the current store, don't get me wrong. But we can only scale that model so
far (as I'm sure anyone from Facebook or Google will tell you).

Anyway, overhauling storage is a daunting proposition. It should be
terrifying (because it is).

I started thinking about how we could facilitate extreme experimentation on
things like say swapping out a new storage backend. Perhaps even with one
implemented in Rust :)

Durham did a lot of work on the manifest code a few months back.
Essentially, he successfully decoupled the API of the manifest from the
low-level implementation of a revlog. It even abstracts away flat versus
tree manifests. It's great stuff. I was reminded of his work when I started
trying to do something similar for revlogs.

Long story short, one thing led to another and I had this crazy idea that
it would be a good idea to declare formal interfaces for important
constructs, like the changelog, manifests, file history, peers, and quite
possibly the repo object itself. If we did this and somehow enforced the
interface, it would be possible to swap out implementations as needed. For
example, if the classic store with 1 file/revlog per tracked path doesn't
scale for you because you have >1M files, then you can swap in a store that
uses "pack files," remote storage, etc. If a classic filesystem-based
working directory doesn't fit the bill, perhaps you swap in one that is
virtual filesystem aware.

I wanted to experiment with this idea with something that is easier to
reason about than storage. That led me to the "peer" classes (peer.py,
httppeer.py, sshpeer.py, etc). I've just submitted a series where I
formalize the peer interface using abstract base classes. Read the commit
messages for https://phab.mercurial-scm.org/D332 and
https://phab.mercurial-scm.org/D339 and the commits in that stack for more

Part of developing that series uncovered a number of minor bugs. And, I
think the end result is a peer API that is more easily understood and
easier to hack on. So, I think there is merit to the approach for code
maintainability reasons alone.

But I want to dream bigger and apply this to more significant primitives
(like storage).

If we adopt formal interfaces for important constructs, I'd like to see
support in the test harness for swapping in alternate implementations of
these things. Jun recently added #testcases syntax to .t tests so we could
run multiple variations of the same test. I'd like do something similar at
the entire test suite level. e.g. `run-tests.py --changelog=sqlite` or
`run-tests.py --store=leveldb`. Or more realistically, `run-tests.py
--peers http,ssh` would run the test suite using both the http and ssh peer
implementations. I could see that culminating with tests naturally dividing
themselves into low-level unit tests (interface implementation specific)
and higher-level, generic integration tests. If we "code to the interface,"
it should be possible to swap in a brand new implementation of something
like a peer protocol or changelog and it will pass the integration tests.
We could even have dummy, super fast implementations to facilitate hacking
on things like frontend features to help reduce the edit-test cycle.

Formal interfaces will facilitate extreme experimentation without the
traditional fragility that a dynamic language like Python introduces. They
will allow us to more clearly define boundaries between components. This
will make it vastly easier to refactor and do things like rewrite large
components in Rust. It would also make it *much* easier to implement "hgit"
(using the Mercurial CLI to interface with a Git repository, advanced
features like revlogs and all). (Call me crazy, but I'd love to ship this
feature as part of Mercurial to help entice new users.)

Regardless of whether we go all in on formal interfaces, there's an
interesting idea 2 paragraphs back: tests that are implementation agnostic.
Today, we end up duplicating test functionality for minor variations. e.g.
http vs ssh vs local peer interactions. bundle1 vs bundle2. I'd really like
to move many tests to Jun's #testcases feature because it will allow us to
achieve higher test coverage while writing fewer tests. I think it would be
worthwhile to figure out how to consolidate as many tests as possible so
they can test behavior, not a specific implementation. A side-benefit of
doing this is we'll uncover areas where implementations vary in behavior.
This will help squash bugs and produce a more consistent user interface and
experience, because every time we see e.g. different behavior between
things performing the same role, we'll ask ourselves why the discrepancy.

Anyway, this is probably the craziest Mercurial idea I've had in a while.
I'm not sure how much of it is realistic. But I'd certainly like to
establish some formality around interfaces for core components to
facilitate code comprehension, refactoring, and testing. The work I just
submitted on the peer API seems to show potential. I'm just not sure if we
can achieve some of the more ambitious goals with the approach I've taken
in that series. I'd love to hear what others think.

Mercurial-devel mailing list

Reply via email to