Re: Issue #525/#4892: on only fetching the pristines we really need

Evgeny Kotkov Fri, 11 Mar 2022 07:24:17 -0800

Julian Foad <julianf...@apache.org> writes:

> Conclusions:
> ------------
>
> It is certainly possible that we could modify "update" and the other
> "online" operations, at least, and the previously "offline" operations
> too if we want, to make them fetch pristines at point-of-use in this way.
>
> Such modifications are not trivial. There is the need to run additional
> RA requests in between the existing ones, perhaps needing an additional
> RA session to be established in parallel, or taking care with inserting
> RA requests into an existing session.


I think that this part has a lot of protocol constraints and hidden complexity.
And things could probably get even more complex for merge and diff.

Consider a bulk update report over HTTP, which is just a single response that
has to be consumed in a streamy fashion.  There is no request multiplexing,
and fetching data through a separate connection is going to limit the maximum
size of a pristine file that can be downloaded without receiving a timeout on
the original connection. Assuming the default HTTP timeout of httpd 2.4.x
(60 seconds) and 100 MB/s data transfer rate, the limit for a pristine size
is going to be around 6 GB.

This kind of problem probably isn't limited to just this specific example and
protocol, considering things like that an update editor driver transfers the
control at certain points (e.g., during merge) and thus cannot keep reading
the response.

When I was working on the proof-of-concept, encountering these issues stopped
me from considering the approach with fetching pristines at the point of access
as being practically feasible.  That also resulted in the alternative approach,
initially implemented on the `pristines-on-demand` branch.

Going slightly off-topic, I tend to think that even despite its drawbacks,
the current approach on the branch should work reasonably well for the MVP.

To elaborate on that, let me first share a few assumptions and thoughts I had
in mind at that point of time:

1) Let's assume a high-throughput connection to the server, 1 Gbps or better.

   With slower networks, working with large files is going to be problematic
   just by itself, so that might be thought as being out of scope.

2) Let's assume that a working copy contains a large number of blob-like files.

   In other words, there are thousands of 10-100 MB files, as opposed to one
   or two 50 GB files.

3) Let's assume that in a common case, only a small fraction of these files
   are modified in the working copy.

Then for a working copy with 1,000 of 100 MB files and 10 modified files:

A) Every checkout saves 100 GB of disk space; that's pretty significant for
   a typical solid state drive.

B) Hydrating won't transfer more than 1 GB of data, or 10 seconds under an
   optimistic assumption.

C) For a more uncommon case with 100 modified files, it's going to result in
   10 GB of data transferred and about two minutes of time; I think that's
   still pretty reasonable for an uncommon case.

So while the approach used in the proof-of-concept might be non-ideal, I tend
to think it should work reasonably well in a variety of use cases.

I also think that this approach should even be releasable in the form of an
MVP, accompanied with a UI option to select between the two states during
checkout (all pristines / pristines-on-demand) that is persisted in the
working copy.

A small final note is that I could be missing some details or other cases,
but in the meantime I felt like sharing the thoughts I had while working on
the proof-of-concept.


Thanks,
Evgeny Kotkov

Re: Issue #525/#4892: on only fetching the pristines we really need

Reply via email to