On 8/10/17 10:32 PM, Gregory Szorc wrote:
I've wanted to hack on overhauling revlogs and storage for several
months now. The more I think about how to go about it and the reasons
why I want to do it (mainly performance and scaling), the more I think a
more drastic departure from revlogs and the current storage "backend" is
needed. There are some great properties of revlogs and the
direct-addressable model of the current store, don't get me wrong. But
we can only scale that model so far (as I'm sure anyone from Facebook or
Google will tell you).

Anyway, overhauling storage is a daunting proposition. It should be
terrifying (because it is).

I started thinking about how we could facilitate extreme experimentation
on things like say swapping out a new storage backend. Perhaps even with
one implemented in Rust :)

Durham did a lot of work on the manifest code a few months back.
Essentially, he successfully decoupled the API of the manifest from the
low-level implementation of a revlog. It even abstracts away flat versus
tree manifests. It's great stuff. I was reminded of his work when I
started trying to do something similar for revlogs.

Long story short, one thing led to another and I had this crazy idea
that it would be a good idea to declare formal interfaces for important
constructs, like the changelog, manifests, file history, peers, and
quite possibly the repo object itself. If we did this and somehow
enforced the interface, it would be possible to swap out implementations
as needed. For example, if the classic store with 1 file/revlog per
tracked path doesn't scale for you because you have >1M files, then you
can swap in a store that uses "pack files," remote storage, etc. If a
classic filesystem-based working directory doesn't fit the bill, perhaps
you swap in one that is virtual filesystem aware.

I wanted to experiment with this idea with something that is easier to
reason about than storage. That led me to the "peer" classes (peer.py,
httppeer.py, sshpeer.py, etc). I've just submitted a series where I
formalize the peer interface using abstract base classes. Read the
commit messages for https://phab.mercurial-scm.org/D332
<https://urldefense.proofpoint.com/v2/url?u=https-3A__phab.mercurial-2Dscm.org_D332&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=nuarHzhP1wi1T9iURRCj1A&m=fikFAFyD9ByWfnQNkcDvHQA7rzOzSpj4nHfviqXvX5Y&s=jC6vqaHho7Al5tlhfi0BPKiUFMXa0-dccg0aVeTtpPs&e=>
and https://phab.mercurial-scm.org/D339
<https://urldefense.proofpoint.com/v2/url?u=https-3A__phab.mercurial-2Dscm.org_D339&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=nuarHzhP1wi1T9iURRCj1A&m=fikFAFyD9ByWfnQNkcDvHQA7rzOzSpj4nHfviqXvX5Y&s=sHB98Ga9MIwu4aVTNeAFZqa2TcOQG3MXwYAeApNjjiY&e=>
and the commits in that stack for more details.

Part of developing that series uncovered a number of minor bugs. And, I
think the end result is a peer API that is more easily understood and
easier to hack on. So, I think there is merit to the approach for code
maintainability reasons alone.

But I want to dream bigger and apply this to more significant primitives
(like storage).

If we adopt formal interfaces for important constructs, I'd like to see
support in the test harness for swapping in alternate implementations of
these things. Jun recently added #testcases syntax to .t tests so we
could run multiple variations of the same test. I'd like do something
similar at the entire test suite level. e.g. `run-tests.py
--changelog=sqlite` or `run-tests.py --store=leveldb`. Or more
realistically, `run-tests.py --peers http,ssh` would run the test suite
using both the http and ssh peer implementations. I could see that
culminating with tests naturally dividing themselves into low-level unit
tests (interface implementation specific) and higher-level, generic
integration tests. If we "code to the interface," it should be possible
to swap in a brand new implementation of something like a peer protocol
or changelog and it will pass the integration tests. We could even have
dummy, super fast implementations to facilitate hacking on things like
frontend features to help reduce the edit-test cycle.

Formal interfaces will facilitate extreme experimentation without the
traditional fragility that a dynamic language like Python introduces.
They will allow us to more clearly define boundaries between components.
This will make it vastly easier to refactor and do things like rewrite
large components in Rust. It would also make it *much* easier to
implement "hgit" (using the Mercurial CLI to interface with a Git
repository, advanced features like revlogs and all). (Call me crazy, but
I'd love to ship this feature as part of Mercurial to help entice new
users.)

Regardless of whether we go all in on formal interfaces, there's an
interesting idea 2 paragraphs back: tests that are implementation
agnostic. Today, we end up duplicating test functionality for minor
variations. e.g. http vs ssh vs local peer interactions. bundle1 vs
bundle2. I'd really like to move many tests to Jun's #testcases feature
because it will allow us to achieve higher test coverage while writing
fewer tests. I think it would be worthwhile to figure out how to
consolidate as many tests as possible so they can test behavior, not a
specific implementation. A side-benefit of doing this is we'll uncover
areas where implementations vary in behavior. This will help squash bugs
and produce a more consistent user interface and experience, because
every time we see e.g. different behavior between things performing the
same role, we'll ask ourselves why the discrepancy.

Anyway, this is probably the craziest Mercurial idea I've had in a
while. I'm not sure how much of it is realistic. But I'd certainly like
to establish some formality around interfaces for core components to
facilitate code comprehension, refactoring, and testing. The work I just
submitted on the peer API seems to show potential. I'm just not sure if
we can achieve some of the more ambitious goals with the approach I've
taken in that series. I'd love to hear what others think.

Gregory

I'm all for more formalized interfaces between layers. We've introduced a relatively formal interface for our remotefilelog/treemanifest storage and it has proven extremely helpful. For instance:

- It allows composing arbitrary stores, as long as they all meet the interface. We can union a series of stores so it can look in local caches first, then local storage, then memcache, then ssh remotely. - We can even insert a fake "store" that converts old formats to new formats. For instance, we have a store where when you ask for a manifest, it reads a flat manifest and turns it into a treemanifest and saves it. By putting that store at the end of the chain of stores we can look at normal tree manifest storage first, and transparently convert flat manifests to trees on demand as the user asks. - The same composability has let us implement bundlerepo more efficiently. Instead of layering the bundle revlogs on top of normal revlogs at the filesystem byte level, we just insert a bundle store at the beginning of the series of stores. - It has let us implement repack algorithms that can transparently operate on top of revlogs. So we coded the repack algorithm without knowledge of revlogs, plug in a revlog store, and repack magically outputs pack files. This could be used for many kinds of format changes.

It has been pretty straight forward and easy to do for manifests and filelogs, because they are mostly random access key->revision lookups. The changelog will be a bit harder because it has very different access patterns that are currently optimized for revlogs.

I think a basic interface could be put in place for those manifest and filelog storage without too much effort. The main place our current interface leaks is around 1) writes, and 2) changegroup creation. We don't have a generic interface for writes yet. Changegroups are very related to revlogs, and we haven't come up with a good abstraction that yet either.

This might be a good place to start using python type hints too.


Examples of our abstraction can be seen here:
https://bitbucket.org/facebook/hg-experimental/src/e61e2670306ee36eaece26188412ac82c97f2d1c/remotefilelog/contentstore.py

unioncontentstore shows a store that unions other stores
remotefilelogcontentstore implements the same interface but looks for data remotely manifestrevlogstore implements the same interface but wraps treemanifest relovgs

That's the data store abstraction (i.e. getting revision text). There's also an abstraction for history storage (node -> p1, p2, linknode, copyinfo)
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to