On Mon, Oct 21, 2019 at 12:42 PM Richard Yao <r...@gentoo.org> wrote:
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
> have the mirrors do the hashing behind the scenes. Then we would have the 
> best of both worlds.

I think something that is getting missed in this discussion is that we
don't control all of our mirrors, and they're generally donated
resources.  Somebody has some webserver, and they stick a Debian
mirror in one directory tree, and an Arch one in another, and they're
kind enough to give us one too.

That is why we're seeing odder situations like ntfs and so on being
mentioned.  They're not necessarily even running Linux, let alone zfs
or some other optimized filesystem.  And their webserver might be set
up to do browsable directory indexes which could perform terribly even
if the filesystem itself is fine with direct filename lookups.  It
doesn't matter if you have hashed b-trees or whatever for filename
lookups if you're going to ask the filesystem to give you a list of
every file in a large directory - it is going to have to traverse
whatever data structure it uses entirely to do so.

If we want to start putting requirements on hosting a mirror, then
we'll end up with less mirrors, and with mirrors more is usually
better.  Ideally a mirror should just be a black box to us - we don't
really care what they're running because we don't depend on any mirror
individually.  Likewise if we negatively impact mirror hosts we'll end
up with less mirrors.  Sure, maybe those hosts have odd
configurations, but we're still better off with them than without.
That said we do seem to have a lot of mirrors so it probably isn't the
end of the world if we lose a limited number.

And there is nothing to say that we can't have some infra mirror set
up more for interactive browsing that we don't have people fetch from
but which dispenses with all the hashing or which bins by the first
letter of the filename/etc.  It seems like most of the use cases where
hashing is inconvenient are for more casual use.

To avoid another reply, people are talking about having utilities that
can fetch distfiles using the new scheme.  I'd think that "ebuild
foo.ebuild fetch" is probably the simplest solution for this.  Chances
are that you're dealing with SRC_URI strings that have variable
substitution in them anyway, so just letting ebuild do the fetching
means you're not substituting ${PV} and so on, let alone all the stuff
versionator and its ilk do.  And of course you can always just fetch
from upstream anyway if you do have a clean URI.


Reply via email to