Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Kent Fredric
On Wed, 23 Oct 2019 01:16:51 -0400
Joshua Kinard  wrote:

> And for Perl or Python, I think we should be making an effort to leverage
> their respective mirroring systems first before putting their distfiles onto
> our mirrors.  Perl's got CPAN, and Python has pypi.  For things that don't
> exist on those systems, then we use our mirrors.

We still have to mirror them, because upstream has a tendency to nuke
things so that they can't be fetched any more from these primary
sources.

So whether end user fetch from the distfiles mirror for the first hit,
or as a fallback, the cost is still there.

The packages aren't broken, upstream hasn't stopped shipping it, just
some upstreams have a fetish for nuking everything but the
latest-and-greatest, and at a pace that is absolutely rediculous and
can't be imagined for us to keep up with with all the stabilization
rigmarole.

Yes, backpan does exist, but its neither perfect, nor fast.

And the faster upstream nukes things, the more likely it is it won't
even be mirrored on backpan!

( I wish I was imagining this circumstance, but its happened far too
often )

And we're not doing our users any service by burdening them with this
madness.


pgpZYBOQeE6ff.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Fabian Groffen
On 29-10-2019 15:45:34 +0100, Michał Górny wrote:
> On Tue, 2019-10-29 at 15:33 +0100, Fabian Groffen wrote:
> > In addition, there are currently files there that aren't referenced from
> > ebuilds.  Prefix uses these files during bootstrap, local mirrors are
> > often much faster than dev.g.o.
> > 
> > If the files don't get mirrored anymore, I guess I can create a dummy
> > ebuild that has the files in SRC_URI.
> 
> Ok, this is something I wasn't aware of.  I agree that dummy ebuild
> should not be necessary here.  However, I'm also not sure if distfiles-
> local is really the proper way either, especially that I don't see such
> files on woodpecker right now.

There should be /space/distfiles-local and
/space/distfiles-whitelist/prefix with a list of files to retain on the
mirror.

Thanks,
Fabian

> I don't think the matter is urgent right now, so let's ponder on it
> a bit.  In particular, I think we should have a clear indication of who
> added which files, when, what for and where they came from.  Those are
> precisely the things that the current distfiles-local approach misses.
> 
> > If the files get mirrored, but put in a subdir based on the filename
> > hash, the original query endpoint on distfiles.g.o changes, much like
> > the SRC_URI approach.
> > 
> > Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o
> > URL with subdir for most part I think, but it's sub-optimal from my
> > point of view.  Calculating the hash is not always feasible due to the
> > lack of b2sum or other means.  Hence my earlier request to have such
> > official translation service on Gentoo hardware.
> > 
> > (I just wrote a small wsgi script that calculates the hash and generates
> > the redirect from Python, served via uwsgi/nginx, but there should be
> > many ways to achieve the same goals, if and only if a blake2b
> > implementation were available for it.)
> 
> This is also something that needs thinking.  I personally don't mind
> having one but it would be nice if it was able to account for geodns
> and such.
> 
> -- 
> Best regards,
> Michał Górny
> 



-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Michał Górny
On Tue, 2019-10-29 at 15:33 +0100, Fabian Groffen wrote:
> On 29-10-2019 15:17:38 +0100, Ulrich Mueller wrote:
> > > > > > > On Tue, 29 Oct 2019, Michał Górny wrote:
> > > On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote:
> > > > > What if the file is hosted at a non-standard tcp port upstream
> > > > > (like http://example.org:8080/)? The devmanual says that it _must_
> > > > > be manually uploaded to /space/distfiles-local/ in such cases.
> > > > Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent,
> > > > or nasty?) upstream blocks wget for some reason, but other methods
> > > > (e.g., curl, firefox) work? How would I get the file onto the mirrors
> > > > there?
> > > If I were you, I would've explicitly mirrored the file anyway.
> > > If upstream blocks wget, then users who do not use GENTOO_MIRRORS will
> > > also suffer due to it.
> > 
> > All what I'm saying is that there can be unusual circumstances where
> > manual uploading of a file is useful. So please don't take that
> > possibility away.
> 
> In addition, there are currently files there that aren't referenced from
> ebuilds.  Prefix uses these files during bootstrap, local mirrors are
> often much faster than dev.g.o.
> 
> If the files don't get mirrored anymore, I guess I can create a dummy
> ebuild that has the files in SRC_URI.

Ok, this is something I wasn't aware of.  I agree that dummy ebuild
should not be necessary here.  However, I'm also not sure if distfiles-
local is really the proper way either, especially that I don't see such
files on woodpecker right now.

I don't think the matter is urgent right now, so let's ponder on it
a bit.  In particular, I think we should have a clear indication of who
added which files, when, what for and where they came from.  Those are
precisely the things that the current distfiles-local approach misses.

> If the files get mirrored, but put in a subdir based on the filename
> hash, the original query endpoint on distfiles.g.o changes, much like
> the SRC_URI approach.
> 
> Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o
> URL with subdir for most part I think, but it's sub-optimal from my
> point of view.  Calculating the hash is not always feasible due to the
> lack of b2sum or other means.  Hence my earlier request to have such
> official translation service on Gentoo hardware.
> 
> (I just wrote a small wsgi script that calculates the hash and generates
> the redirect from Python, served via uwsgi/nginx, but there should be
> many ways to achieve the same goals, if and only if a blake2b
> implementation were available for it.)
> 

This is also something that needs thinking.  I personally don't mind
having one but it would be nice if it was able to account for geodns
and such.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Fabian Groffen
On 29-10-2019 15:17:38 +0100, Ulrich Mueller wrote:
> > On Tue, 29 Oct 2019, Michał Górny wrote:
> 
> > On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote:
> >> > What if the file is hosted at a non-standard tcp port upstream
> >> > (like http://example.org:8080/)? The devmanual says that it _must_
> >> > be manually uploaded to /space/distfiles-local/ in such cases.
> 
> >> Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent,
> >> or nasty?) upstream blocks wget for some reason, but other methods
> >> (e.g., curl, firefox) work? How would I get the file onto the mirrors
> >> there?
> 
> > If I were you, I would've explicitly mirrored the file anyway.
> > If upstream blocks wget, then users who do not use GENTOO_MIRRORS will
> > also suffer due to it.
> 
> All what I'm saying is that there can be unusual circumstances where
> manual uploading of a file is useful. So please don't take that
> possibility away.

In addition, there are currently files there that aren't referenced from
ebuilds.  Prefix uses these files during bootstrap, local mirrors are
often much faster than dev.g.o.

If the files don't get mirrored anymore, I guess I can create a dummy
ebuild that has the files in SRC_URI.

If the files get mirrored, but put in a subdir based on the filename
hash, the original query endpoint on distfiles.g.o changes, much like
the SRC_URI approach.

Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o
URL with subdir for most part I think, but it's sub-optimal from my
point of view.  Calculating the hash is not always feasible due to the
lack of b2sum or other means.  Hence my earlier request to have such
official translation service on Gentoo hardware.

(I just wrote a small wsgi script that calculates the hash and generates
the redirect from Python, served via uwsgi/nginx, but there should be
many ways to achieve the same goals, if and only if a blake2b
implementation were available for it.)

Thanks,
Fabian

-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Ulrich Mueller
> On Tue, 29 Oct 2019, Michał Górny wrote:

> On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote:
>> > What if the file is hosted at a non-standard tcp port upstream
>> > (like http://example.org:8080/)? The devmanual says that it _must_
>> > be manually uploaded to /space/distfiles-local/ in such cases.

>> Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent,
>> or nasty?) upstream blocks wget for some reason, but other methods
>> (e.g., curl, firefox) work? How would I get the file onto the mirrors
>> there?

> If I were you, I would've explicitly mirrored the file anyway.
> If upstream blocks wget, then users who do not use GENTOO_MIRRORS will
> also suffer due to it.

All what I'm saying is that there can be unusual circumstances where
manual uploading of a file is useful. So please don't take that
possibility away.

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Michał Górny
On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote:
> > > > > > On Tue, 29 Oct 2019, Ulrich Mueller wrote:
> > > > > > On Tue, 29 Oct 2019, Michał Górny wrote:
> > > The file should be placed in SRC_URI, and emirrordist will take care
> > > of fetching it.
> > What if the file is hosted at a non-standard tcp port upstream (like
> > http://example.org:8080/)? The devmanual says that it _must_ be manually
> > uploaded to /space/distfiles-local/ in such cases.
> 
> Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent,
> or nasty?) upstream blocks wget for some reason, but other methods
> (e.g., curl, firefox) work? How would I get the file onto the mirrors
> there?
> 

If I were you, I would've explicitly mirrored the file anyway.
If upstream blocks wget, then users who do not use GENTOO_MIRRORS will
also suffer due to it.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Michał Górny
On Tue, 2019-10-29 at 14:03 +0100, Ulrich Mueller wrote:
> > > > > > On Tue, 29 Oct 2019, Michał Górny wrote:
> > On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote:
> > > So, what has to be be done to have it appear in the proper place?
> > > Should the file be placed in a subdir of /space/distfiles-local/?
> > > That seems to be error prone, and certainly could be automated?
> > The file should be placed in SRC_URI, and emirrordist will take care
> > of fetching it.
> 
> What if the file is hosted at a non-standard tcp port upstream (like
> http://example.org:8080/)? The devmanual says that it _must_ be manually
> uploaded to /space/distfiles-local/ in such cases.
> 

I can't really see why this wouldn't work.  I've just did an experiment
using app-benchmarks/forkbomb, and emirrordist fetched it just fine.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Ulrich Mueller
> On Tue, 29 Oct 2019, Ulrich Mueller wrote:

> On Tue, 29 Oct 2019, Michał Górny wrote:
>> The file should be placed in SRC_URI, and emirrordist will take care
>> of fetching it.

> What if the file is hosted at a non-standard tcp port upstream (like
> http://example.org:8080/)? The devmanual says that it _must_ be manually
> uploaded to /space/distfiles-local/ in such cases.

Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent,
or nasty?) upstream blocks wget for some reason, but other methods
(e.g., curl, firefox) work? How would I get the file onto the mirrors
there?

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Ulrich Mueller
> On Tue, 29 Oct 2019, Michał Górny wrote:

> On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote:
>> So, what has to be be done to have it appear in the proper place?
>> Should the file be placed in a subdir of /space/distfiles-local/?
>> That seems to be error prone, and certainly could be automated?

> The file should be placed in SRC_URI, and emirrordist will take care
> of fetching it.

What if the file is hosted at a non-standard tcp port upstream (like
http://example.org:8080/)? The devmanual says that it _must_ be manually
uploaded to /space/distfiles-local/ in such cases.

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Michał Górny
On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote:
> > > > > > On Tue, 29 Oct 2019, Michał Górny wrote:
> > Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen  
> > napisał(a):
> > > /space/distfiles-local is no longer copied to the mirrors? or just
> > > not copied in the subdir-hierarchy?
> > The latter.
> 
> So, what has to be be done to have it appear in the proper place? Should
> the file be placed in a subdir of /space/distfiles-local/? That seems to
> be error prone, and certainly could be automated?

The file should be placed in SRC_URI, and emirrordist will take care of
fetching it.

> 
> > > Just wondering. Do you mean it isn't valid that some upstreams do
> > > this (yes horror)? We surely need a way to work around that ...
> > I mean the method using same filename and expecting distfiles-local to
> > overwrite it. It is preferable to just rename it.
> 
> Looks like this will break backwards compatibility. IIUC, backwards
> compatibility is also broken on the receiving side, that is,
> mirror://gentoo/ in SRC_URI will no longer work as expected?

Yes, this was noted in the top mail.

> Shouldn't GLEP 75 have mentioned this? It's certainly something that
> needs to be discussed before the GLEP is implemented.

GLEP only covers how regular distfile fetching works.  Third-party
mirrors are out of scope, and all the people working on it and reviewing
it have missed the problem.  That said, this can't be fixed within
bounds defined by PMS.

Given that mirror://gentoo is discouraged since at least 2011, I don't
see a big deal here.  One day it'll stop working; we should stop using
it before then.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Ulrich Mueller
> On Tue, 29 Oct 2019, Michał Górny wrote:

> Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen  
> napisał(a):
>> /space/distfiles-local is no longer copied to the mirrors? or just
>> not copied in the subdir-hierarchy?

> The latter.

So, what has to be be done to have it appear in the proper place? Should
the file be placed in a subdir of /space/distfiles-local/? That seems to
be error prone, and certainly could be automated?

>> Just wondering. Do you mean it isn't valid that some upstreams do
>> this (yes horror)? We surely need a way to work around that ...

> I mean the method using same filename and expecting distfiles-local to
> overwrite it. It is preferable to just rename it.

Looks like this will break backwards compatibility. IIUC, backwards
compatibility is also broken on the receiving side, that is,
mirror://gentoo/ in SRC_URI will no longer work as expected?

Shouldn't GLEP 75 have mentioned this? It's certainly something that
needs to be discussed before the GLEP is implemented.

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Michał Górny
Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen  
napisał(a):
>On 29-10-2019 05:27:37 +0100, Michał Górny wrote:
>> On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn
>wrote:
>> > Hi!
>> > 
>> > > Today you get chastised for using /space/distfiles-local and not
>> > > following policy changes.  The devmanual states that it's
>deprecated
>> > > since at least 2011, and talks of using d.g.o [1].
>> >  > [1] 
>> >
>https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts
>> > 
>> > Sorry I'm late to the party, but I would like to enquire about what
>happens 
>> > if a file with existing filename but different b2sum gets uploaded
>to 
>> > /space/distfiles-local now?
>> 
>> The same as before.  It gets put in top-level disfiles directory. 
>> Hashes are calculated from filenames, so this wouldn't affect it. 
>That
>> is, if it put those files in subdirectories in the first place
>because
>> it doesn't.
>
>/space/distfiles-local is no longer copied to the mirrors? or just not
>copied in the subdir-hierarchy?

The latter.

>
>> > Doing so and updating the Manifest used to be another (not
>necessarily 
>> > preferred) method to address upstream remaking release packages.
>> > 
>> 
>> It's no longer valid.
>
>Just wondering.  Do you mean it isn't valid that some upstreams do this
>(yes horror)?  We surely need a way to work around that ...

I mean the method using same filename and expecting distfiles-local to 
overwrite it. It is preferable to just rename it.

>
>Thanks,
>Fabian


--
Best regards, 
Michał Górny



Re: [gentoo-dev] New distfile mirror layout

2019-10-29 Thread Fabian Groffen
On 29-10-2019 05:27:37 +0100, Michał Górny wrote:
> On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn wrote:
> > Hi!
> > 
> > > Today you get chastised for using /space/distfiles-local and not
> > > following policy changes.  The devmanual states that it's deprecated
> > > since at least 2011, and talks of using d.g.o [1].
> >  > [1] 
> > https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts
> > 
> > Sorry I'm late to the party, but I would like to enquire about what happens 
> > if a file with existing filename but different b2sum gets uploaded to 
> > /space/distfiles-local now?
> 
> The same as before.  It gets put in top-level disfiles directory. 
> Hashes are calculated from filenames, so this wouldn't affect it.  That
> is, if it put those files in subdirectories in the first place because
> it doesn't.

/space/distfiles-local is no longer copied to the mirrors? or just not
copied in the subdir-hierarchy?

> > Doing so and updating the Manifest used to be another (not necessarily 
> > preferred) method to address upstream remaking release packages.
> > 
> 
> It's no longer valid.

Just wondering.  Do you mean it isn't valid that some upstreams do this
(yes horror)?  We surely need a way to work around that ...

Thanks,
Fabian


-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-28 Thread Michał Górny
On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn wrote:
> Hi!
> 
> > Today you get chastised for using /space/distfiles-local and not
> > following policy changes.  The devmanual states that it's deprecated
> > since at least 2011, and talks of using d.g.o [1].
>  > [1] 
> https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts
> 
> Sorry I'm late to the party, but I would like to enquire about what happens 
> if a file with existing filename but different b2sum gets uploaded to 
> /space/distfiles-local now?

The same as before.  It gets put in top-level disfiles directory. 
Hashes are calculated from filenames, so this wouldn't affect it.  That
is, if it put those files in subdirectories in the first place because
it doesn't.

> Doing so and updating the Manifest used to be another (not necessarily 
> preferred) method to address upstream remaking release packages.
> 

It's no longer valid.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-28 Thread Chí-Thanh Christopher Nguyễn

Hi!


Today you get chastised for using /space/distfiles-local and not
following policy changes.  The devmanual states that it's deprecated
since at least 2011, and talks of using d.g.o [1].
> [1] 
https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts


Sorry I'm late to the party, but I would like to enquire about what happens 
if a file with existing filename but different b2sum gets uploaded to 
/space/distfiles-local now?


Doing so and updating the Manifest used to be another (not necessarily 
preferred) method to address upstream remaking release packages.



Best regards,
Chí-Thanh Christopher Nguyễn



Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread Michał Górny
On Wed, 2019-10-23 at 17:04 -0500, William Hubbs wrote:
> On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:
> > On 10/21/2019 19:36, Matt Turner wrote:
> > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
> > > > Also, another idea is to use a cheap hash function (e.g. fletcher) and 
> > > > just have the mirrors do the hashing behind the scenes. Then we would 
> > > > have the best of both worlds.
> > > 
> > > It probably would have been better to make these suggestions when the
> > > GLEP was discussed close to two years ago.
> > > 
> > > I'm glad that we have ideas for improvements but I worry that we're
> > > just backseat driving at this point given that the GLEP's now
> > > implemented.
> > 
> > Agreed, although, I don't even remember this coming up two years ago.  But,
> > I was tied up with a lot of work-related stress and tasks, so probably just
> > my memory storage backend not having enough cycles to commit it 
> > to...neurons.
>  
>  After looking at this further, I found that the glep was presented to
>  us in Jan 2018 on the dev ml [1].
> 
> I checked all council meeting logs and discovered that this was never
> brought to us formally for approval.
> 
> It looks like the developers decided to do this as an
> infrastructure/portage project and because of that they felt like they
> didn't need a glep.
> 

...or simply forgotten whether it was approved or not after waiting
almost two years for Portage team provide a reference implementation.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread Richard Yao


> On Oct 23, 2019, at 7:48 PM, Richard Yao  wrote:
> 
> On 10/22/19 2:51 AM, Jaco Kroon wrote:
>> Hi All,
>> 
>> 
>>> On 2019/10/21 18:42, Richard Yao wrote:
>>> 
>>> If we consider the access frequency, it might actually not be that
>>> bad. Consider a simple example with 500 files and two directory
>>> buckets. If we have 250 in each, then the size of the directory is
>>> always 250. However, if 50 files are accessed 90% of the time, then
>>> putting 450 into one directory and that 50 into another directory, we
>>> end up with the performance of the O(n) directory lookup being
>>> consistent with there being only 90 files in each directory.
>>> 
>>> I am not sure if we should be discarding all other considerations to
>>> make changes to benefit O(n) directory lookup filesystems, but if we
>>> are, then the hashing approach is not necessarily the best one. It is
>>> only the best when all files are accessed with equal frequency, which
>>> would be an incorrect assumption. A more human friendly approach might
>>> still be better. I doubt that we have the data to determine that though.
>>> 
>>> Also, another idea is to use a cheap hash function (e.g. fletcher) and
>>> just have the mirrors do the hashing behind the scenes. Then we would
>>> have the best of both worlds.
>> 
>> 
>> Experience:
>> 
>> ext4 sucks at targeting name lookups without dir_index feature (O(n)
>> lookups - scans all entries in the folder).  With dir_index readdir
>> performance is crap.  Pick your poison I guess.  Most of our larger
>> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
>> disabling dir_index as the benefit is outweighed by the crappy readdir()
>> and glob() performance.
> My read of the ext4 disk layout documentation is that the read operation
> should work mostly the same way, except with a penalty from reading
> larger directories caused by the addition of the tree's metadata and
> from having more partially filled blocks:
> 
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
> 
> The code itself is the same traversal code:
> 
> https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106
> 
> However, a couple of things stand out to me here at a glance:
> 
> 1. `cond_resched()` adds scheduler delay for no apparent reason.
> `cond_resched()` is meant to be used in places where we could block
> excessively on non-PREEMPT kernels, but this doesn't strike me as one of
> those places. The fact that we block on disk on uncached reads naturally
> serves the same purpose, so an explicit rescheduling point here is
> redundant. PREEMPT kernels should perform better in readdir() on ext4 by
> virtue of making `cond_resched()` a no-op.
I just realized that the way that I worded this could be confusing, so please 
allow me to clarify what I meant. cond_resched() is meant for when a kernel 
thread will tie up a CPU for a long period of time. Blocking on disk will cause 
the CPU to be released to another thread. When we do not block on disk, this 
operation is quick. There is no good reason for putting cond_resched() here as 
far as I can tell.
> 2. read-ahead is implemented in a way that appears to be over-reading
> the directory whenever the needed information is not cached. This is
> technically read-ahead, although it is not a great way of doing it. A
> much better way to do this would be to pipeline `readdir()` by
> initiating asynchronous read operations in anticipation of future reads.
> 
> Both of thse should affect both variants of ext4's directories, but the
> penalties I mentioned earlier mean that the dir_index variant would be
> affected more.
> 
> If you have a way to benchmark things, a simple idea to evaluate would
> be deleting the `cond_resched()` line. If we had data showing an
> improvement, I would be happy to send a small one-line patch deleting
> the line to Ted to get the change into mainline.
The more I think about this, the more absurd having cond_resched() here seems 
to me. I think I will sit on it for a few days. If it still seems absurd to me 
after sitting on it, I’ll send Ted a patch to delete that line with the remark 
that the use of cond_resched() is redundant with blocking on disk.
>> There doesn't seem to be a real specific tip-over point, and it seems to
>> depend a lot on RAM availability and harddrive speed (obviously).  So if
>> dentries gets cached, disk speeds becomes less of an issue.  However, on
>> large folders (where I typically use 10k as a value for large based on
>> "gut feeling" and "unquantifiable experience" and "nothing scientific at
>> all") I find that even with lots of RAM two consecutive ls commands
>> remains terribly slow. Switch off dir_index and that becomes an order of
>> magnitude faster.
>> 
>> I don't have a great deal of experience with XFS, but on those systems
>> where we do it's generally on a VM, and perceivably (again, not
>> scientific) our experience has been that it feels slower.  Again, not
>> scientific, 

ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread Richard Yao
On 10/22/19 2:51 AM, Jaco Kroon wrote:
> Hi All,
> 
> 
> On 2019/10/21 18:42, Richard Yao wrote:
>>
>> If we consider the access frequency, it might actually not be that
>> bad. Consider a simple example with 500 files and two directory
>> buckets. If we have 250 in each, then the size of the directory is
>> always 250. However, if 50 files are accessed 90% of the time, then
>> putting 450 into one directory and that 50 into another directory, we
>> end up with the performance of the O(n) directory lookup being
>> consistent with there being only 90 files in each directory.
>>
>> I am not sure if we should be discarding all other considerations to
>> make changes to benefit O(n) directory lookup filesystems, but if we
>> are, then the hashing approach is not necessarily the best one. It is
>> only the best when all files are accessed with equal frequency, which
>> would be an incorrect assumption. A more human friendly approach might
>> still be better. I doubt that we have the data to determine that though.
>>
>> Also, another idea is to use a cheap hash function (e.g. fletcher) and
>> just have the mirrors do the hashing behind the scenes. Then we would
>> have the best of both worlds.
> 
> 
> Experience:
> 
> ext4 sucks at targeting name lookups without dir_index feature (O(n)
> lookups - scans all entries in the folder).  With dir_index readdir
> performance is crap.  Pick your poison I guess.  Most of our larger
> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
> disabling dir_index as the benefit is outweighed by the crappy readdir()
> and glob() performance.
My read of the ext4 disk layout documentation is that the read operation
should work mostly the same way, except with a penalty from reading
larger directories caused by the addition of the tree's metadata and
from having more partially filled blocks:

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries

The code itself is the same traversal code:

https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106

However, a couple of things stand out to me here at a glance:

1. `cond_resched()` adds scheduler delay for no apparent reason.
`cond_resched()` is meant to be used in places where we could block
excessively on non-PREEMPT kernels, but this doesn't strike me as one of
those places. The fact that we block on disk on uncached reads naturally
serves the same purpose, so an explicit rescheduling point here is
redundant. PREEMPT kernels should perform better in readdir() on ext4 by
virtue of making `cond_resched()` a no-op.

2. read-ahead is implemented in a way that appears to be over-reading
the directory whenever the needed information is not cached. This is
technically read-ahead, although it is not a great way of doing it. A
much better way to do this would be to pipeline `readdir()` by
initiating asynchronous read operations in anticipation of future reads.

Both of thse should affect both variants of ext4's directories, but the
penalties I mentioned earlier mean that the dir_index variant would be
affected more.

If you have a way to benchmark things, a simple idea to evaluate would
be deleting the `cond_resched()` line. If we had data showing an
improvement, I would be happy to send a small one-line patch deleting
the line to Ted to get the change into mainline.

> There doesn't seem to be a real specific tip-over point, and it seems to
> depend a lot on RAM availability and harddrive speed (obviously).  So if
> dentries gets cached, disk speeds becomes less of an issue.  However, on
> large folders (where I typically use 10k as a value for large based on
> "gut feeling" and "unquantifiable experience" and "nothing scientific at
> all") I find that even with lots of RAM two consecutive ls commands
> remains terribly slow. Switch off dir_index and that becomes an order of
> magnitude faster.
> 
> I don't have a great deal of experience with XFS, but on those systems
> where we do it's generally on a VM, and perceivably (again, not
> scientific) our experience has been that it feels slower.  Again, not
> scientific, just perception.
> 
> I'm in support for the change.  This will bucket to 256 folders and
> should have a reasonably even split between folders.  If required a
> second layer could be introduced by using the 3rd and 4th digits of the
> hash for a second layer.  Any hash should be fine, it really doesn't
> need to be cryptographically strong, it just needs to provide a good
> spread and be really fast.  Generally a hash table should have a prime
> number of buckets to assist with hash bias, but frankly, that's over
> complicating the situation here.
> 
> I also agree with others that it used to be easy to get distfiles as and
> when needed, so an alternative structure could mirror that of the
> portage tree itself, in other words "cat/pkg/distfile". This perhaps
> just shifts the issue:
> 
> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
> "*-*" | wc -l
> 167
> 

Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread William Hubbs
On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:
> On 10/21/2019 19:36, Matt Turner wrote:
> > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
> >> Also, another idea is to use a cheap hash function (e.g. fletcher) and 
> >> just have the mirrors do the hashing behind the scenes. Then we would have 
> >> the best of both worlds.
> > 
> > It probably would have been better to make these suggestions when the
> > GLEP was discussed close to two years ago.
> > 
> > I'm glad that we have ideas for improvements but I worry that we're
> > just backseat driving at this point given that the GLEP's now
> > implemented.
> 
> Agreed, although, I don't even remember this coming up two years ago.  But,
> I was tied up with a lot of work-related stress and tasks, so probably just
> my memory storage backend not having enough cycles to commit it to...neurons.
 
 After looking at this further, I found that the glep was presented to
 us in Jan 2018 on the dev ml [1].

I checked all council meeting logs and discovered that this was never
brought to us formally for approval.

It looks like the developers decided to do this as an
infrastructure/portage project and because of that they felt like they
didn't need a glep.

Thanks,

William

 [1] 
https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba

> IMHO, perhaps future GLEPs should have a defined window to implement them
> following discussion.  Having the discussion, then waiting a few years
> before implementing them leads to discussions like this where we're arguing
> about the color of the boat after the boat has sailed off into the distance.
> 
> -- 
> Joshua Kinard
> Gentoo/MIPS
> ku...@gentoo.org
> rsa6144/5C63F4E3F5C6C943 2015-04-27
> 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943
> 
> "The past tempts us, the present confuses us, the future frightens us.  And
> our lives slip away, moment by moment, lost in that vast, terrible 
> in-between."
> 
> --Emperor Turhan, Centauri Republic
> 


signature.asc
Description: Digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread William Hubbs
On Wed, Oct 23, 2019 at 12:06:24PM -0500, William Hubbs wrote:
> On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:
> > On 10/21/2019 19:36, Matt Turner wrote:
> > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
> > >> Also, another idea is to use a cheap hash function (e.g. fletcher) and 
> > >> just have the mirrors do the hashing behind the scenes. Then we would 
> > >> have the best of both worlds.
> > > 
> > > It probably would have been better to make these suggestions when the
> > > GLEP was discussed close to two years ago.
> > > 
> > > I'm glad that we have ideas for improvements but I worry that we're
> > > just backseat driving at this point given that the GLEP's now
> > > implemented.
>  
>  Nothing is really etched in stone, so we could change it again if we
>  see fit.
 
 Actually, which glep are we talking about? If we are talking about glep
 75, I don't see where the council approved it [1], so it definitely
 should be discussed/approved before any implementation changes are
 made, or we should see where it was approved.

 William

 [1] https://www.gentoo.org/glep/glep-0075.html


signature.asc
Description: Digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-23 Thread William Hubbs
On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:
> On 10/21/2019 19:36, Matt Turner wrote:
> > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
> >> Also, another idea is to use a cheap hash function (e.g. fletcher) and 
> >> just have the mirrors do the hashing behind the scenes. Then we would have 
> >> the best of both worlds.
> > 
> > It probably would have been better to make these suggestions when the
> > GLEP was discussed close to two years ago.
> > 
> > I'm glad that we have ideas for improvements but I worry that we're
> > just backseat driving at this point given that the GLEP's now
> > implemented.
 
 Nothing is really etched in stone, so we could change it again if we
 see fit.

*snip*

> IMHO, perhaps future GLEPs should have a defined window to implement them
> following discussion.  Having the discussion, then waiting a few years
> before implementing them leads to discussions like this where we're arguing
> about the color of the boat after the boat has sailed off into the distance.

Agreed. I will work on a proposal for the next council meeting.

Thanks,

William



signature.asc
Description: Digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Joshua Kinard
On 10/21/2019 19:36, Matt Turner wrote:
> On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
>> Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
>> have the mirrors do the hashing behind the scenes. Then we would have the 
>> best of both worlds.
> 
> It probably would have been better to make these suggestions when the
> GLEP was discussed close to two years ago.
> 
> I'm glad that we have ideas for improvements but I worry that we're
> just backseat driving at this point given that the GLEP's now
> implemented.

Agreed, although, I don't even remember this coming up two years ago.  But,
I was tied up with a lot of work-related stress and tasks, so probably just
my memory storage backend not having enough cycles to commit it to...neurons.

IMHO, perhaps future GLEPs should have a defined window to implement them
following discussion.  Having the discussion, then waiting a few years
before implementing them leads to discussions like this where we're arguing
about the color of the boat after the boat has sailed off into the distance.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Joshua Kinard
On 10/21/2019 06:13, Kent Fredric wrote:
> On Sun, 20 Oct 2019 16:57:54 -0400
> Joshua Kinard  wrote:
> 
>> I know we've got a ton of Perl packages for the core set of Perl modules,
>> but doesn't the CPAN eclass also have the capability to auto-generate an
>> ebuild package for virtually any Perl package distributed via CPAN?  Can
>> that logic be used with the CTAN system in its own eclass and then we remove
>> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
>> might just have to generate ebuilds for texlive modules and treat them as
>> discrete, installed packages.
> 
> - Perl packages never have more than 1:1 source archives per ebuild
> - Perl upstream naming doesn't habitually use "perl-" as an archive prefix
> - Everything that is packaged for Perl in Gentoo is mirrored to the
>   Gentoo distfiles mirror, and this causes no issues.
> 
> So I don't think any comparison here makes sense.

I have to disagree on the "doesn't make sense" bit.  Regardless of what it
is that TexLive is packaging, the problem that I feel exists is storing
these macro packages on our mirrors is what is responsible for 20% of *all*
distfiles that we store.  That's lopsided that a small collection of
ebuilds, due to the way their build logic is architected, has that many
distfiles on the mirrors.

It's scenarios like that which led to Michał developing the GLEP the way he
did.  His approach is more broad, seeking to future-proof the mirroring
issue regardless of package mirroring decisions, whereas I'm more curious
why texlive needs all of those packages on our mirrors when it appears to
have a fairly comprehensive mirroring system of its own.  Why reinvent the
wheel?

Since CTAN exists as a worldwide mirroring system, I think at a minimum, we
should try to fetch from that directly instead of mirroring them on our own
systems and partner mirrors.  Or we could go the other way and become an
official CTAN mirror ourselves.  After all, if we're going to reinvent the
wheel, do all four instead of just one.

And for Perl or Python, I think we should be making an effort to leverage
their respective mirroring systems first before putting their distfiles onto
our mirrors.  Perl's got CPAN, and Python has pypi.  For things that don't
exist on those systems, then we use our mirrors.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Rich Freeman
On Mon, Oct 21, 2019 at 12:42 PM Richard Yao  wrote:
>
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
> have the mirrors do the hashing behind the scenes. Then we would have the 
> best of both worlds.

I think something that is getting missed in this discussion is that we
don't control all of our mirrors, and they're generally donated
resources.  Somebody has some webserver, and they stick a Debian
mirror in one directory tree, and an Arch one in another, and they're
kind enough to give us one too.

That is why we're seeing odder situations like ntfs and so on being
mentioned.  They're not necessarily even running Linux, let alone zfs
or some other optimized filesystem.  And their webserver might be set
up to do browsable directory indexes which could perform terribly even
if the filesystem itself is fine with direct filename lookups.  It
doesn't matter if you have hashed b-trees or whatever for filename
lookups if you're going to ask the filesystem to give you a list of
every file in a large directory - it is going to have to traverse
whatever data structure it uses entirely to do so.

If we want to start putting requirements on hosting a mirror, then
we'll end up with less mirrors, and with mirrors more is usually
better.  Ideally a mirror should just be a black box to us - we don't
really care what they're running because we don't depend on any mirror
individually.  Likewise if we negatively impact mirror hosts we'll end
up with less mirrors.  Sure, maybe those hosts have odd
configurations, but we're still better off with them than without.
That said we do seem to have a lot of mirrors so it probably isn't the
end of the world if we lose a limited number.

And there is nothing to say that we can't have some infra mirror set
up more for interactive browsing that we don't have people fetch from
but which dispenses with all the hashing or which bins by the first
letter of the filename/etc.  It seems like most of the use cases where
hashing is inconvenient are for more casual use.

To avoid another reply, people are talking about having utilities that
can fetch distfiles using the new scheme.  I'd think that "ebuild
foo.ebuild fetch" is probably the simplest solution for this.  Chances
are that you're dealing with SRC_URI strings that have variable
substitution in them anyway, so just letting ebuild do the fetching
means you're not substituting ${PV} and so on, let alone all the stuff
versionator and its ilk do.  And of course you can always just fetch
from upstream anyway if you do have a clean URI.

-- 
Rich



Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Jaco Kroon

Hi,

On 2019/10/22 10:43, Ulrich Mueller wrote:

On Tue, 22 Oct 2019, Jaco Kroon wrote:

I also agree with others that it used to be easy to get distfiles as
and when needed, so an alternative structure could mirror that of the
portage tree itself, in other words "cat/pkg/distfile".

Not a good idea, because some distfiles are shared between packages.
For example, sys-kernel/*-sources use the same distfiles. (It won't
work with categories either, e.g., there are dev-lang/ruby and
app-emacs/ruby-mode.)

Ulrich


You are absolutely correct.  I then fully agree with current implementation.

Kind Regards,
Jaco




Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Ulrich Mueller
> On Tue, 22 Oct 2019, Jaco Kroon wrote:

> I also agree with others that it used to be easy to get distfiles as
> and when needed, so an alternative structure could mirror that of the
> portage tree itself, in other words "cat/pkg/distfile".

Not a good idea, because some distfiles are shared between packages.
For example, sys-kernel/*-sources use the same distfiles. (It won't
work with categories either, e.g., there are dev-lang/ruby and
app-emacs/ruby-mode.)

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-22 Thread Jaco Kroon

Hi All,


On 2019/10/21 18:42, Richard Yao wrote:


If we consider the access frequency, it might actually not be that bad. 
Consider a simple example with 500 files and two directory buckets. If we have 
250 in each, then the size of the directory is always 250. However, if 50 files 
are accessed 90% of the time, then putting 450 into one directory and that 50 
into another directory, we end up with the performance of the O(n) directory 
lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make 
changes to benefit O(n) directory lookup filesystems, but if we are, then the 
hashing approach is not necessarily the best one. It is only the best when all 
files are accessed with equal frequency, which would be an incorrect 
assumption. A more human friendly approach might still be better. I doubt that 
we have the data to determine that though.

Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
have the mirrors do the hashing behind the scenes. Then we would have the best 
of both worlds.



Experience:

ext4 sucks at targeting name lookups without dir_index feature (O(n) 
lookups - scans all entries in the folder).  With dir_index readdir 
performance is crap.  Pick your poison I guess.  Most of our larger 
filesystems (2TB+, but especially the 80TB+ ones) we've reverted to 
disabling dir_index as the benefit is outweighed by the crappy readdir() 
and glob() performance.


There doesn't seem to be a real specific tip-over point, and it seems to 
depend a lot on RAM availability and harddrive speed (obviously).  So if 
dentries gets cached, disk speeds becomes less of an issue.  However, on 
large folders (where I typically use 10k as a value for large based on 
"gut feeling" and "unquantifiable experience" and "nothing scientific at 
all") I find that even with lots of RAM two consecutive ls commands 
remains terribly slow. Switch off dir_index and that becomes an order of 
magnitude faster.


I don't have a great deal of experience with XFS, but on those systems 
where we do it's generally on a VM, and perceivably (again, not 
scientific) our experience has been that it feels slower.  Again, not 
scientific, just perception.


I'm in support for the change.  This will bucket to 256 folders and 
should have a reasonably even split between folders.  If required a 
second layer could be introduced by using the 3rd and 4th digits of the 
hash for a second layer.  Any hash should be fine, it really doesn't 
need to be cryptographically strong, it just needs to provide a good 
spread and be really fast.  Generally a hash table should have a prime 
number of buckets to assist with hash bias, but frankly, that's over 
complicating the situation here.


I also agree with others that it used to be easy to get distfiles as and 
when needed, so an alternative structure could mirror that of the 
portage tree itself, in other words "cat/pkg/distfile". This perhaps 
just shifts the issue:


jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name 
"*-*" | wc -l

167
jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
19412
jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i 
-maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10

347 net-misc
373 media-sound
395 media-libs
399 dev-util
505 dev-libs
528 dev-java
684 dev-haskell
690 dev-ruby
1601 dev-perl
1889 dev-python

So that's average 116 sub folders under the top layer (only two over 
1000), and then presumably less than 100 distfiles maximum per package?  
Probably overkill but would (should) solve both the too many files per 
folder as well as the easy lookup by hand issue.


I don't have a preference on either solution though but do agree that 
"easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.


Kind Regards,

Jaco


Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread James Cloos
> "RY" == Richard Yao  writes:

RY> ext4 is probably okay, but don’t quote me on that.

Ext4 works fine here for a local distfiles mirror.

-JimC
-- 
James Cloos  OpenPGP: 0x997A9F17ED7DAEA6



Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread Matt Turner
On Mon, Oct 21, 2019 at 9:42 AM Richard Yao  wrote:
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
> have the mirrors do the hashing behind the scenes. Then we would have the 
> best of both worlds.

It probably would have been better to make these suggestions when the
GLEP was discussed close to two years ago.

I'm glad that we have ideas for improvements but I worry that we're
just backseat driving at this point given that the GLEP's now
implemented.



Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread Mikle Kolyada

On 21.10.2019 3:05, Joshua Kinard wrote:
> So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables:
>
>   - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names
>   - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names
>   - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names
>
> Then, in texlive-module.eclass, there's these loops:
>
> for i in ${TEXLIVE_MODULE_CONTENTS}; do
> SRC_URI="${SRC_URI} 
> mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
> done
>
> # Forge doc SRC_URI
> [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? ("
> for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do
> SRC_URI="${SRC_URI} 
> mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
> done
> [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )"
>
> # Forge source SRC_URI
> if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then
> SRC_URI="${SRC_URI} source? ("
> for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do
> SRC_URI="${SRC_URI} 
> mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
> done
> SRC_URI="${SRC_URI} )"
> fi
>
> I think this is definitely an issue with how this package is laying out its
> needed distfiles.  It really should be leveraging CTAN system at a minimum
> to fetch all of the needed distfiles so we can get them off of our
> distfiles mirror.  Then it would be interesting to re-run the math on
> the distfiles distribution using the different schemes highlighted in the
> GLEP-75 paper.

TexLive distributes collections of macros, not  packages separately,
they make their packaging based on CTAN. In the meantime CTAN packages
are not versioned, they only have internal release number, no tags,
releases and so on, see [1].

I also fail to see what problem you try to solve when suggest fetching
macros from CTAN, you are going to have the same amount of data mirrored
as a result.

[1] - https://ctan.org/tex-archive/systems/texlive/tlnet/archive



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread Richard Yao



> On Oct 20, 2019, at 2:51 AM, Michał Górny  wrote:
> 
> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
>>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>> 
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>> 
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem. 
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>> 
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>> 
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>> 
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>> 
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>> 
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>> 
>>> 
>>> Alternatively, you can:
>>> 
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>> 
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>> 
>>> 
>>> If you're interested in more background details and some plots, see [1].
>>> 
>>> [1] 
>>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>> 
>> 
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>> 
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of fragmentation
>> as devs retire or package maintainership changes hands.
> 
> Today you get chastised for using /space/distfiles-local and not
> following policy changes.  The devmanual states that it's deprecated
> since at least 2011, and talks of using d.g.o [1].
> 
>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>> 
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to work
>> out the file path will be *really* annoying.
> 
> Your solution still doesn't solve the problem of having 8k-24k files
> in a single directory, even if you use 7 letters of prefix.  So it just
> creates a lot of tiny directory noise for no practical gain.
> 
> [1] 
> https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts

If we consider the access frequency, it might actually not be that bad. 
Consider a simple example with 500 files and two directory buckets. If we have 
250 in each, then the size of the directory is always 250. However, if 50 files 
are accessed 90% of the time, then putting 450 into one directory and that 50 
into another directory, we end up with the performance of the O(n) directory 
lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make 
changes to benefit O(n) directory lookup filesystems, but if we are, then the 
hashing approach is not necessarily the best one. It is only the 

Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread Kent Fredric
On Sun, 20 Oct 2019 20:05:40 -0400
Joshua Kinard  wrote:

> Longer-term, I think this entire approach should be revisited by the TeX
> team to make it behave more like Perl or Python packages by having discrete
> ebuilds for these modules.  That's not exactly a small undertaking, but
> this current approach feels very kludgy in its design and is probably
> asking for trouble.  I looked at several of the modules on CTAN, and they
> each have their own version and even have different licenses.

With the current state of the portage dependency resolver, and with
regards to the constant problems end users face with it, I really can't
advise this unless you need to.

Currently working on vendoring rust in an overlay, and 128 ebuilds just
to satisfy the dependencies enough to test *one* package is a bit of a
piss-take.

I'd suggest waiting a few years for portage to see some improvements
here before taking on something that ambitious when the current
approach works well enough.


pgpt_T_YZWasM.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-21 Thread Kent Fredric
On Sun, 20 Oct 2019 16:57:54 -0400
Joshua Kinard  wrote:

> I know we've got a ton of Perl packages for the core set of Perl modules,
> but doesn't the CPAN eclass also have the capability to auto-generate an
> ebuild package for virtually any Perl package distributed via CPAN?  Can
> that logic be used with the CTAN system in its own eclass and then we remove
> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
> might just have to generate ebuilds for texlive modules and treat them as
> discrete, installed packages.

- Perl packages never have more than 1:1 source archives per ebuild
- Perl upstream naming doesn't habitually use "perl-" as an archive prefix
- Everything that is packaged for Perl in Gentoo is mirrored to the
  Gentoo distfiles mirror, and this causes no issues.

So I don't think any comparison here makes sense.


pgpS7vStGzQ3x.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Ulrich Mueller
> On Mon, 21 Oct 2019, Joshua Kinard wrote:

>   - altfont is licensed under "GNU General Public License" (version ??)
>   - achemso is licensed under "The LaTeX Project Public License 1.3c"
>   - arraysort is licensed under "The LaTeX Project Public License 1.2"
>   - amsfonts is licensed under "The SIL Open Font License"
>   - a0poster is licensed under "The LaTeX Project Public License" (ver ??)
>   - arydshln is licensed under "The LaTeX Project Public License 1"
>   - aurl is licensed under "Public Domain Software"

> That's just a random selection from the 'a' category. Do we have
> copies of those licenses in the tree?

Yes.

> Do they allow redistribution of the distfiles?

Yes.

> For the users that want "free" software, do any of the licenses in any
> of the TeX modules put up any disagreeable restrictions?

All of TeXLive should be free software. Upstream doesn't accept anything
that is non-free. (Mistakes can happen, though. There was one non-free
module in texlive-latexextra-2019, which was sorted out in bug 687328.)

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Joshua Kinard
On 10/20/2019 16:57, Joshua Kinard wrote:> On 10/20/2019 05:44, Michal Górny 
wrote:
>> On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:
>>> On 10/20/2019 04:32, Michal Górny wrote:
[snip]
>> You believe it to be a problem.  Don't expect others to bother upstream
>> with your preferences.
> 
> Hah.  So you consider texlive having 16k+ distfiles to be completely within
> operating norms then?
> 
> I did a quick look, and it looks like the TeX project has a fairly
> comprehensive mirroring system distributed around the world.  In fact, it
> looks like they emulate Perl's CPAN system with "CTAN":
> 
> https://ctan.org/
> 
> I don't know the history of the texlive and other associated tex packages in
> Gentoo, but my guess is instead of doing what our Perl packages do, someone
> just decided to mirror the CTAN archive directly on the Gentoo distfiles
> system.  It seems to me that what should actually happen is that we leverage
> CTAN itself, much like CPAN, and use their mirroring system instead of
> burdening our infrastructure as an unofficial CTAN archive.
> 
> I know we've got a ton of Perl packages for the core set of Perl modules,
> but doesn't the CPAN eclass also have the capability to auto-generate an
> ebuild package for virtually any Perl package distributed via CPAN?  Can
> that logic be used with the CTAN system in its own eclass and then we remove
> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
> might just have to generate ebuilds for texlive modules and treat them as
> discrete, installed packages.

So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables:

  - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names
  - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names
  - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names

Then, in texlive-module.eclass, there's these loops:

for i in ${TEXLIVE_MODULE_CONTENTS}; do
SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
done

# Forge doc SRC_URI
[ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? ("
for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do
SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
done
[ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )"

# Forge source SRC_URI
if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then
SRC_URI="${SRC_URI} source? ("
for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do
SRC_URI="${SRC_URI} 
mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
done
SRC_URI="${SRC_URI} )"
fi

I think this is definitely an issue with how this package is laying out its
needed distfiles.  It really should be leveraging CTAN system at a minimum
to fetch all of the needed distfiles so we can get them off of our
distfiles mirror.  Then it would be interesting to re-run the math on
the distfiles distribution using the different schemes highlighted in the
GLEP-75 paper.

Longer-term, I think this entire approach should be revisited by the TeX
team to make it behave more like Perl or Python packages by having discrete
ebuilds for these modules.  That's not exactly a small undertaking, but
this current approach feels very kludgy in its design and is probably
asking for trouble.  I looked at several of the modules on CTAN, and they
each have their own version and even have different licenses.

E.g.,

  - altfont is licensed under "GNU General Public License" (version ??)
  - achemso is licensed under "The LaTeX Project Public License 1.3c"
  - arraysort is licensed under "The LaTeX Project Public License 1.2"
  - amsfonts is licensed under "The SIL Open Font License"
  - a0poster is licensed under "The LaTeX Project Public License" (ver ??)
  - arydshln is licensed under "The LaTeX Project Public License 1"
  - aurl is licensed under "Public Domain Software"

That's just a random selection from the 'a' category.  Do we have copies
of those licenses in the tree?  Do they allow redistribution of the
distfiles?  For the users that want "free" software, do any of the licenses
in any of the TeX modules put up any disagreeable restrictions?

Etc...

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Joshua Kinard
On 10/20/2019 05:44, Michał Górny wrote:
> On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:
>> On 10/20/2019 04:32, Michał Górny wrote:
>>> On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
 Why is having a max ~24k files in a directory a bad idea?  Modern
 filesystems are more than capable of handling that.

   - ext4: unlimited files in a directory
   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
   - ntfs: 4,294,967,295

 And 24k is a bit more than 1/3rd of all distfiles that we currently have.
>>>
>>> For the same reason having ~60k files in a directory was a problem. 
>>> There is really no point in changing anything if you change BIG_NUMBER
>>> to SMALLER_BIG_NUMBER.
>>
>> That doesn't answer my question.  Why is it a problem?  What criteria are
>> you using to decide that 24k is a "smaller big number"?  Is there some issue
>> highlighted by the mirror admins where having 24k files in a single
>> directory offers no significant relief versus the current 60k files?
> 
> IIRC Robin set the goal as:
> 
> | the number of files in a single directory should not exceed 1000, [1]
> 
> I don't recall how that number was chosen but it's probably pretty
> arbitrary.  In any case, I can notice the difference between working
> with a listing of 1k files and 24k files, on the hardware running
> masterdist.

I think it would be prudent then to get some data to help underpin why that
number was chosen and add that to the GLEP, possibly as one of the
references at the bottom.  Your personal observations of a system
(masterdist) that few of us have access to is not good enough, especially
for future developers who may revisit this topic long after you or I are gone.


> 
 Under which scenario do you wind up with 24k files in a single directory?  
 I
 consider the tex package an outlier in this case (one package should not be
 the sole dictator of policy).
>>>
>>> Three versions of TeXLive living simultaneously.  If one package falls
>>> completely out of bounds, no problem is solved by the change, so what's
>>> the point of making it?
>>
>> The problem in this case is with texlive, not our current, or future,
>> distfiles methodology.
> 
> Is it?  Are you suggesting we should ban upstream from using multiple
> distfiles with similar prefix?  What about other potential packages that
> may suffer from the same problem in the future?  Go packages have a good
> potential, given that majority of them starts with 'github.com'.

Please highlight which of my words imply in any way that I want to ban
something.  I simply said texlive's significant number of distfiles is a
problem.  That doesn't mean that I want to resolve the problem by banning
it, or future packages that employ that method.

My concern is that out of the tens of thousands of packages we have, we're
allowing ONE package to dictate how we shape a major piece of Gentoo
infrastructure, and I don't feel that the proposed solution seeks to address
it.  Rather, it seeks to band-aid it by wrapping the entire distro up like a
mummy.


>> Has anyone looked at how other distros deal with texlive?
> 
> Other distros don't mirror original distfiles.

Has thought be given to doing the same?  This is arguably a better approach
than mirroring original distfiles in devspace.  This would significantly
reduce the infrastructure burden on the project.


>>   Has anyone complained or filed a bug to texlive developers
>> upstream about their excessive amount of distfiles and the burden it places
>> on distro maintainers?
> 
> You believe it to be a problem.  Don't expect others to bother upstream
> with your preferences.

Hah.  So you consider texlive having 16k+ distfiles to be completely within
operating norms then?

I did a quick look, and it looks like the TeX project has a fairly
comprehensive mirroring system distributed around the world.  In fact, it
looks like they emulate Perl's CPAN system with "CTAN":

https://ctan.org/

I don't know the history of the texlive and other associated tex packages in
Gentoo, but my guess is instead of doing what our Perl packages do, someone
just decided to mirror the CTAN archive directly on the Gentoo distfiles
system.  It seems to me that what should actually happen is that we leverage
CTAN itself, much like CPAN, and use their mirroring system instead of
burdening our infrastructure as an unofficial CTAN archive.

I know we've got a ton of Perl packages for the core set of Perl modules,
but doesn't the CPAN eclass also have the capability to auto-generate an
ebuild package for virtually any Perl package distributed via CPAN?  Can
that logic be used with the CTAN system in its own eclass and then we remove
the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
might just have to generate ebuilds for texlive modules and treat them as
discrete, installed packages.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org

Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Matt Turner
On Sun, Oct 20, 2019 at 1:25 AM Joshua Kinard  wrote:
> In any event, I still think using devspace is a bad idea.  A centralized
> distfiles repo is what most other distros use, and it's what we should use.

I agree, but let's discuss that in a separate topic.



Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Michał Górny
On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:
> On 10/20/2019 04:32, Michał Górny wrote:
> > On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
> > > Why is having a max ~24k files in a directory a bad idea?  Modern
> > > filesystems are more than capable of handling that.
> > > 
> > >   - ext4: unlimited files in a directory
> > >   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
> > >   - ntfs: 4,294,967,295
> > > 
> > > And 24k is a bit more than 1/3rd of all distfiles that we currently have.
> > 
> > For the same reason having ~60k files in a directory was a problem. 
> > There is really no point in changing anything if you change BIG_NUMBER
> > to SMALLER_BIG_NUMBER.
> 
> That doesn't answer my question.  Why is it a problem?  What criteria are
> you using to decide that 24k is a "smaller big number"?  Is there some issue
> highlighted by the mirror admins where having 24k files in a single
> directory offers no significant relief versus the current 60k files?

IIRC Robin set the goal as:

| the number of files in a single directory should not exceed 1000, [1]

I don't recall how that number was chosen but it's probably pretty
arbitrary.  In any case, I can notice the difference between working
with a listing of 1k files and 24k files, on the hardware running
masterdist.

> > > Under which scenario do you wind up with 24k files in a single directory? 
> > >  I
> > > consider the tex package an outlier in this case (one package should not 
> > > be
> > > the sole dictator of policy).
> > 
> > Three versions of TeXLive living simultaneously.  If one package falls
> > completely out of bounds, no problem is solved by the change, so what's
> > the point of making it?
> 
> The problem in this case is with texlive, not our current, or future,
> distfiles methodology.

Is it?  Are you suggesting we should ban upstream from using multiple
distfiles with similar prefix?  What about other potential packages that
may suffer from the same problem in the future?  Go packages have a good
potential, given that majority of them starts with 'github.com'.

> Has anyone looked at how other distros deal with texlive?

Other distros don't mirror original distfiles.

>   Has anyone complained or filed a bug to texlive developers
> upstream about their excessive amount of distfiles and the burden it places
> on distro maintainers?

You believe it to be a problem.  Don't expect others to bother upstream
with your preferences.


[1] https://www.gentoo.org/glep/glep-0075.html#algorithm-for-splitting-distfiles

> 
-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Joshua Kinard
On 10/20/2019 04:32, Michał Górny wrote:
> On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
>> On 10/20/2019 02:51, Michał Górny wrote:
>>> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
 On 10/18/2019 09:41, Michał Górny wrote:
> Hi, everybody.
>
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
>
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem. 
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
>
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
>
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
>
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
>
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
>
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
>
>
> Alternatively, you can:
>
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
>
>
> If you're interested in more background details and some plots, see [1].
>
> [1] 
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>

 So the answer I didn't really see directly stated here is, where do new
 distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
 distfile to /space/distfiles-local.  What is the new directory I need to
 use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
 target, what would be the applicable prefix to use?

 Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
 got chastised for doing exactly that.  Too much possibility of 
 fragmentation
 as devs retire or package maintainership changes hands.
>>>
>>> Today you get chastised for using /space/distfiles-local and not
>>> following policy changes.  The devmanual states that it's deprecated
>>> since at least 2011, and talks of using d.g.o [1].
>>
>> I don't recall this change being added as far back as 2011.  Maybe my memory
>> is bad, but if it was done that long ago, it was done quietly, and it was
>> not enforced.  I checked my local mailing list archives for gentoo-dev and
>> don't see any mention of distfiles-local being deprecated back then.  Why
>> has it taken 8 years for this to get addressed?
> 
> Don't ask me.  I think I was already taught to use d.g.o back when I was
> recruited.
> 
>> In any event, I still think using devspace is a bad idea.  A centralized
>> distfiles repo is what most other distros use, and it's what we should use.
> 
> Talking doesn't make things happen.  Coming up with good proposals that
> address all the problems (e.g. those listed in devmanual) does.

Proposing changes when a direction has already been decided, the rudder
position changed, and engines put to full power is equally as pointless.
You're the defacto captain of this ship lately.  I expect you to not rock
the boat too hard.  This change is a pretty hard jolt, IMHO.


 I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
 hash-based naming scheme on the new distfiles layout.  I really kind prefer
 breaking the directories up based on the first letter of the distfiles in
 question, factoring case-sensitivity in (so you'd have 52 top-level
 directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
 directories, additional subdirectories for the next few letters (say,
 letters 2-3).  Yes, this leads to some orphan cases where a distfile might
 live on its own, but from a direct navigation standpoint, it's easy to find
 for someone browsing the distfiles server and easy to predict where a
 distfile is at.

 No math, statistical analysis, or deep-rooted knowledge of filesystems
 behind that 

Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Michał Górny
On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
> On 10/20/2019 02:51, Michał Górny wrote:
> > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
> > > On 10/18/2019 09:41, Michał Górny wrote:
> > > > Hi, everybody.
> > > > 
> > > > It is my pleasure to announce that yesterday (EU) evening we've switched
> > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > > > already -- as their caches expire (24hrs).
> > > > 
> > > > The new layout is mostly a bow towards mirror admins, for some of whom
> > > > having a 6+ files in a single directory have been a problem. 
> > > > However, I suppose some of you also found e.g. the directory index
> > > > hardly usable due to its size.
> > > > 
> > > > Throughout a transitional period (whose exact length hasn't been decided
> > > > yet), both layouts will be available.  Afterwards, the old layout will
> > > > be removed from mirrors.  This has a few implications:
> > > > 
> > > > 1. Users who don't upgrade their package managers in time will lose
> > > > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > > > much of a problem given that the core software needed to upgrade Portage
> > > > should all have reliable upstream SRC_URIs.
> > > > 
> > > > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > > > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > > > discarding its usage and moving distfiles to devspace.
> > > > 
> > > > 3. Directly fetching files from distfiles.gentoo.org will become
> > > > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > > > to use something like:
> > > > 
> > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > > > 1b
> > > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > > > ...
> > > > 
> > > > 
> > > > Alternatively, you can:
> > > > 
> > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> > > > 
> > > > and grep for the right path there.  This INDEX is also a more
> > > > lightweight alternative to HTML indexes generated by the servers.
> > > > 
> > > > 
> > > > If you're interested in more background details and some plots, see [1].
> > > > 
> > > > [1] 
> > > > https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> > > > 
> > > 
> > > So the answer I didn't really see directly stated here is, where do new
> > > distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> > > distfile to /space/distfiles-local.  What is the new directory I need to
> > > use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> > > target, what would be the applicable prefix to use?
> > > 
> > > Directly using devspace seems like a bad idea, IMHO.  Once long ago, we 
> > > all
> > > got chastised for doing exactly that.  Too much possibility of 
> > > fragmentation
> > > as devs retire or package maintainership changes hands.
> > 
> > Today you get chastised for using /space/distfiles-local and not
> > following policy changes.  The devmanual states that it's deprecated
> > since at least 2011, and talks of using d.g.o [1].
> 
> I don't recall this change being added as far back as 2011.  Maybe my memory
> is bad, but if it was done that long ago, it was done quietly, and it was
> not enforced.  I checked my local mailing list archives for gentoo-dev and
> don't see any mention of distfiles-local being deprecated back then.  Why
> has it taken 8 years for this to get addressed?

Don't ask me.  I think I was already taught to use d.g.o back when I was
recruited.

> In any event, I still think using devspace is a bad idea.  A centralized
> distfiles repo is what most other distros use, and it's what we should use.

Talking doesn't make things happen.  Coming up with good proposals that
address all the problems (e.g. those listed in devmanual) does.

> > > I looked at the whitepaper'ish-like writeup, and I kinda don't like using 
> > > a
> > > hash-based naming scheme on the new distfiles layout.  I really kind 
> > > prefer
> > > breaking the directories up based on the first letter of the distfiles in
> > > question, factoring case-sensitivity in (so you'd have 52 top-level
> > > directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> > > directories, additional subdirectories for the next few letters (say,
> > > letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> > > live on its own, but from a direct navigation standpoint, it's easy to 
> > > find
> > > for someone browsing the distfiles server and easy to predict where a
> > > distfile is at.
> > > 
> > > No math, statistical analysis, or deep-rooted knowledge of filesystems
> > > behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> > > need to go get a distfile off the Gentoo mirrors, and being able to 
> > > quickly
> > > find it in the mirror root 

Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Joshua Kinard
On 10/20/2019 02:51, Michał Górny wrote:
> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem. 
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1] 
>>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of fragmentation
>> as devs retire or package maintainership changes hands.
> 
> Today you get chastised for using /space/distfiles-local and not
> following policy changes.  The devmanual states that it's deprecated
> since at least 2011, and talks of using d.g.o [1].

I don't recall this change being added as far back as 2011.  Maybe my memory
is bad, but if it was done that long ago, it was done quietly, and it was
not enforced.  I checked my local mailing list archives for gentoo-dev and
don't see any mention of distfiles-local being deprecated back then.  Why
has it taken 8 years for this to get addressed?

In any event, I still think using devspace is a bad idea.  A centralized
distfiles repo is what most other distros use, and it's what we should use.


>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to work
>> out the file path will be *really* annoying.
> 
> Your solution still doesn't solve the problem of having 8k-24k files
> in a single directory, even if you use 7 letters of prefix.  So it just
> creates a lot of tiny directory noise for no practical gain.

Why is having a max ~24k files in a directory a bad idea?  Modern
filesystems are more than capable of handling that.

  - ext4: unlimited files in a directory
  - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
  - ntfs: 4,294,967,295

And 24k is a bit more than 1/3rd of all distfiles that we 

Re: [gentoo-dev] New distfile mirror layout

2019-10-20 Thread Michał Górny
On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
> On 10/18/2019 09:41, Michał Górny wrote:
> > Hi, everybody.
> > 
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> > 
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 6+ files in a single directory have been a problem. 
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> > 
> > Throughout a transitional period (whose exact length hasn't been decided
> > yet), both layouts will be available.  Afterwards, the old layout will
> > be removed from mirrors.  This has a few implications:
> > 
> > 1. Users who don't upgrade their package managers in time will lose
> > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > much of a problem given that the core software needed to upgrade Portage
> > should all have reliable upstream SRC_URIs.
> > 
> > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > discarding its usage and moving distfiles to devspace.
> > 
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> > 
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> > 
> > 
> > Alternatively, you can:
> > 
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> > 
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> > 
> > 
> > If you're interested in more background details and some plots, see [1].
> > 
> > [1] 
> > https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> > 
> 
> So the answer I didn't really see directly stated here is, where do new
> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> distfile to /space/distfiles-local.  What is the new directory I need to
> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> target, what would be the applicable prefix to use?
> 
> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
> got chastised for doing exactly that.  Too much possibility of fragmentation
> as devs retire or package maintainership changes hands.

Today you get chastised for using /space/distfiles-local and not
following policy changes.  The devmanual states that it's deprecated
since at least 2011, and talks of using d.g.o [1].

> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
> hash-based naming scheme on the new distfiles layout.  I really kind prefer
> breaking the directories up based on the first letter of the distfiles in
> question, factoring case-sensitivity in (so you'd have 52 top-level
> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> directories, additional subdirectories for the next few letters (say,
> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> live on its own, but from a direct navigation standpoint, it's easy to find
> for someone browsing the distfiles server and easy to predict where a
> distfile is at.
> 
> No math, statistical analysis, or deep-rooted knowledge of filesystems
> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> need to go get a distfile off the Gentoo mirrors, and being able to quickly
> find it in the mirror root is great.  Having to do hash calculations to work
> out the file path will be *really* annoying.

Your solution still doesn't solve the problem of having 8k-24k files
in a single directory, even if you use 7 letters of prefix.  So it just
creates a lot of tiny directory noise for no practical gain.

[1] 
https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Joshua Kinard
On 10/19/2019 19:57, Alec Warner wrote:
> On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard  wrote:
> 
>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1]
>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
> 
> 
> 
> 
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of
>> fragmentation
>> as devs retire or package maintainership changes hands.
>>
>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to
>> work
>> out the file path will be *really* annoying.
>>
> 
> So if you want a tool that "downloads a distfile off of the mirrors" we
> should be able to build such a utility.
> 
> I'm not really sure why that tool needs to be:
> *copy DISTFILENAME*
> wget distilfes.gentoo.org/$PASTE
> 
> It could just `ebuild portageq download $DISTFILENAME or similar.`
> 
> -A

Sometimes, I'm not on a Gentoo system, or even a Linux/Unix platform, when I
go to fetch a distfile.  Could (and have) fetched as such off of Debian's
mirrors before, but Gentoo is what I know and fetching a distfile off of
those mirrors manually was generally very straight forward.

Not a common case, and certainly not a blocker.  I was just pointing out
that hashed-based naming is decidedly a lot less human-friendly.  But,
that's been the general trend for all-things technology these last few years.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Alec Warner
On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard  wrote:

> On 10/18/2019 09:41, Michał Górny wrote:
> > Hi, everybody.
> >
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> >
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 6+ files in a single directory have been a problem.
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> >
> > Throughout a transitional period (whose exact length hasn't been decided
> > yet), both layouts will be available.  Afterwards, the old layout will
> > be removed from mirrors.  This has a few implications:
> >
> > 1. Users who don't upgrade their package managers in time will lose
> > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > much of a problem given that the core software needed to upgrade Portage
> > should all have reliable upstream SRC_URIs.
> >
> > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > discarding its usage and moving distfiles to devspace.
> >
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> >
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> >
> >
> > Alternatively, you can:
> >
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> >
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> >
> >
> > If you're interested in more background details and some plots, see [1].
> >
> > [1]
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> >
>
> So the answer I didn't really see directly stated here is, where do new
> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> distfile to /space/distfiles-local.  What is the new directory I need to
> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> target, what would be the applicable prefix to use?
>




>
> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
> got chastised for doing exactly that.  Too much possibility of
> fragmentation
> as devs retire or package maintainership changes hands.
>
> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
> hash-based naming scheme on the new distfiles layout.  I really kind prefer
> breaking the directories up based on the first letter of the distfiles in
> question, factoring case-sensitivity in (so you'd have 52 top-level
> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> directories, additional subdirectories for the next few letters (say,
> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> live on its own, but from a direct navigation standpoint, it's easy to find
> for someone browsing the distfiles server and easy to predict where a
> distfile is at.
>
> No math, statistical analysis, or deep-rooted knowledge of filesystems
> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> need to go get a distfile off the Gentoo mirrors, and being able to quickly
> find it in the mirror root is great.  Having to do hash calculations to
> work
> out the file path will be *really* annoying.
>

So if you want a tool that "downloads a distfile off of the mirrors" we
should be able to build such a utility.

I'm not really sure why that tool needs to be:
*copy DISTFILENAME*
wget distilfes.gentoo.org/$PASTE

It could just `ebuild portageq download $DISTFILENAME or similar.`

-A






>
> --
> Joshua Kinard
> Gentoo/MIPS
> ku...@gentoo.org
> rsa6144/5C63F4E3F5C6C943 2015-04-27
> 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943
>
> "The past tempts us, the present confuses us, the future frightens us.  And
> our lives slip away, moment by moment, lost in that vast, terrible
> in-between."
>
> --Emperor Turhan, Centauri Republic
>
>


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Joshua Kinard
On 10/18/2019 09:41, Michał Górny wrote:
> Hi, everybody.
> 
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> 
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem. 
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
> 
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
> 
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
> 
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
> 
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
> 
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
> 
> 
> Alternatively, you can:
> 
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
> 
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
> 
> 
> If you're interested in more background details and some plots, see [1].
> 
> [1] 
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> 

So the answer I didn't really see directly stated here is, where do new
distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
distfile to /space/distfiles-local.  What is the new directory I need to
use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
target, what would be the applicable prefix to use?

Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
got chastised for doing exactly that.  Too much possibility of fragmentation
as devs retire or package maintainership changes hands.

I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
hash-based naming scheme on the new distfiles layout.  I really kind prefer
breaking the directories up based on the first letter of the distfiles in
question, factoring case-sensitivity in (so you'd have 52 top-level
directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
directories, additional subdirectories for the next few letters (say,
letters 2-3).  Yes, this leads to some orphan cases where a distfile might
live on its own, but from a direct navigation standpoint, it's easy to find
for someone browsing the distfiles server and easy to predict where a
distfile is at.

No math, statistical analysis, or deep-rooted knowledge of filesystems
behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
need to go get a distfile off the Gentoo mirrors, and being able to quickly
find it in the mirror root is great.  Having to do hash calculations to work
out the file path will be *really* annoying.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 19, 2019, at 4:03 PM, Michał Górny  wrote:
> 
> On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
>>> 
>>> 
> On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
 On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>> On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
> Hi, everybody.
> It is my pleasure to announce that yesterday (EU) evening we've 
> switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
> This sounds like a filesystem issue. Do we know which filesystems are 
> suffering?
> ZFS should be fine. I believe ext2/ext3 have problems with this many 
> files. ext4 is probably okay, but don’t quote me on that.
 Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
 may apply only to older ntfs versions.  NFS has been mentioned too.
>>> 
>>> ext2 and vfat are not surprises to me (outside of the idea that anyone 
>>> would use them for a mirror). NTFS and NFS are though.
 However, just because modern filesystems can handle them efficiently, it
 doesn't mean having directories that huge comes with zero cost.
>>> While I am okay with the change, what do you mean when you say that having 
>>> huge directories does not come with zero cost?
>>> 
>>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
>>> this, but the impact should be negligible. Filesystems with O(log n) 
>>> directory lookups would see faster directory lookups.
>>> 
>>> Outside of directory lookups, this could speed up up searches and sort 
>>> operations when listing everything with just about any filesystem 
>>> benefiting from the improvement.
>>> 
>>> Listing directories on such filesystems should not benefit from this unless 
>>> you are using ls where the default behavior is to sort the directory 
>>> contents (which is where the improvement when sorting comes into play). The 
>>> need to sort the directory contents by default keeps ls from displaying 
>>> anything until it has scanned the entire directory. The asymptotic 
>>> complexity of a fast comparison based sort improves in this situation from 
>>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
>>> independently. A further speed up could be obtained by doing multithreading 
>>> to parallelize the sort operations.
>> I read your original email late at night and I misread the description of 
>> how this works.
>> 
>> At an initial glance, I thought we were doing a prefix approach (with the 
>> caveat that buckets are unbalanced). In reality, we are doing a 
>> cryptographic hash of the filenames.
>> 
>> That would keep all buckets balanced, which gives the best directory lookup 
>> times on O(log n) lookup filesystems, but I think there is something to be 
>> gained from using the less optimal approach of using filename prefixes:
>> 
>> * some regex searches on distfiles can be accelerated
>> * generating a sorted list of all distfiles becomes asymptotically faster
>> * it is easy for a user to find all versions of a given distfile
>> * no need to calculate a cryptographic hash
>> 
>> I realize that I am late to propose it, but could we consider a switch to 
>> this alternative arrangement?
> 
> No, we can't.  Please read either the original discussion on the bug, or
> the linked article.  It's explained in detail why this won't work.
Alright. I am convinced. Thanks.
> 
> -- 
> Best regards,
> Michał Górny
> 




Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
> > 
> > 
> > > > On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
> > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny  
> > > > > > > > > wrote:
> > > > > > > > Hi, everybody.
> > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've 
> > > > > > > > switched
> > > > > > > > to a new distfile mirror layout.  Users will be switching to 
> > > > > > > > the new
> > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they 
> > > > > > > > upgraded
> > > > > > > > already -- as their caches expire (24hrs).
> > > > > > > > The new layout is mostly a bow towards mirror admins, for some 
> > > > > > > > of whom
> > > > > > > > having a 6+ files in a single directory have been a problem.
> > > > > > > > However, I suppose some of you also found e.g. the directory 
> > > > > > > > index
> > > > > > > > hardly usable due to its size.
> > > > This sounds like a filesystem issue. Do we know which filesystems are 
> > > > suffering?
> > > > ZFS should be fine. I believe ext2/ext3 have problems with this many 
> > > > files. ext4 is probably okay, but don’t quote me on that.
> > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > > may apply only to older ntfs versions.  NFS has been mentioned too.
> > 
> > ext2 and vfat are not surprises to me (outside of the idea that anyone 
> > would use them for a mirror). NTFS and NFS are though.
> > > However, just because modern filesystems can handle them efficiently, it
> > > doesn't mean having directories that huge comes with zero cost.
> > While I am okay with the change, what do you mean when you say that having 
> > huge directories does not come with zero cost?
> > 
> > Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> > this, but the impact should be negligible. Filesystems with O(log n) 
> > directory lookups would see faster directory lookups.
> > 
> > Outside of directory lookups, this could speed up up searches and sort 
> > operations when listing everything with just about any filesystem 
> > benefiting from the improvement.
> > 
> > Listing directories on such filesystems should not benefit from this unless 
> > you are using ls where the default behavior is to sort the directory 
> > contents (which is where the improvement when sorting comes into play). The 
> > need to sort the directory contents by default keeps ls from displaying 
> > anything until it has scanned the entire directory. The asymptotic 
> > complexity of a fast comparison based sort improves in this situation from 
> > O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
> > independently. A further speed up could be obtained by doing multithreading 
> > to parallelize the sort operations.
> I read your original email late at night and I misread the description of how 
> this works.
> 
> At an initial glance, I thought we were doing a prefix approach (with the 
> caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
> hash of the filenames.
> 
> That would keep all buckets balanced, which gives the best directory lookup 
> times on O(log n) lookup filesystems, but I think there is something to be 
> gained from using the less optimal approach of using filename prefixes:
> 
> * some regex searches on distfiles can be accelerated
> * generating a sorted list of all distfiles becomes asymptotically faster
> * it is easy for a user to find all versions of a given distfile
> * no need to calculate a cryptographic hash
> 
> I realize that I am late to propose it, but could we consider a switch to 
> this alternative arrangement?

No, we can't.  Please read either the original discussion on the bug, or
the linked article.  It's explained in detail why this won't work.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
> 
> 
>>> On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
>>> Hi, everybody.
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>> This sounds like a filesystem issue. Do we know which filesystems are 
>>> suffering?
>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
>>> ext4 is probably okay, but don’t quote me on that.
>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>> may apply only to older ntfs versions.  NFS has been mentioned too.
> 
> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
> use them for a mirror). NTFS and NFS are though.
>> However, just because modern filesystems can handle them efficiently, it
>> doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having 
> huge directories does not come with zero cost?
> 
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> this, but the impact should be negligible. Filesystems with O(log n) 
> directory lookups would see faster directory lookups.
> 
> Outside of directory lookups, this could speed up up searches and sort 
> operations when listing everything with just about any filesystem benefiting 
> from the improvement.
> 
> Listing directories on such filesystems should not benefit from this unless 
> you are using ls where the default behavior is to sort the directory contents 
> (which is where the improvement when sorting comes into play). The need to 
> sort the directory contents by default keeps ls from displaying anything 
> until it has scanned the entire directory. The asymptotic complexity of a 
> fast comparison based sort improves in this situation from O(nlogn) to 
> O(nlog(n/b)) provided that you sort each subdirectory independently. A 
> further speed up could be obtained by doing multithreading to parallelize the 
> sort operations.
I read your original email late at night and I misread the description of how 
this works.

At an initial glance, I thought we were doing a prefix approach (with the 
caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
hash of the filenames.

That would keep all buckets balanced, which gives the best directory lookup 
times on O(log n) lookup filesystems, but I think there is something to be 
gained from using the less optimal approach of using filename prefixes:

* some regex searches on distfiles can be accelerated
* generating a sorted list of all distfiles becomes asymptotically faster
* it is easy for a user to find all versions of a given distfile
* no need to calculate a cryptographic hash

I realize that I am late to propose it, but could we consider a switch to this 
alternative arrangement?

The bulk of the performance gain should be realized with either approach.

> Since I know someone will call me out on that comment, I will explain. Each 
> bucket has roughly n/b items in it where n is the total number and b is the 
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
> comparison sort on this very special case where you call it multiple times on 
> data that has been persorted by prefix into buckets.
> 
> Is there any other benefit to this or did I get everything?
> 
> By the way, it is offtopic for the thread, but it occurs to me that a hybrid 
> of radix sort and A comparison based sort could give us a general sorting 
> algorithm that is asymptotically faster than O(nlogn).
>> [1] https://bugs.gentoo.org/534528
>> --
>> Best regards,
>> Michał Górny




Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Sat, 2019-10-19 at 15:31 +0200, Fabian Groffen wrote:
> Hi,
> 
> On 18-10-2019 15:41:32 +0200, Michał Górny wrote:
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> > 
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> > 
> > 
> > Alternatively, you can:
> > 
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> > 
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> 
> Would it be possible to run a service that sends a 302 for the
> distfiles/foo-1.tar.gz to the appropriate bucket such that manual
> fetching doesn't require to calculate the hash?
> 
> I prototyped this myself for distfiles.prefix, and seems like a nice
> guesture for at least the transition period?
> 

That would only for servers whose admins would explicitly install
the service, i.e. not for anyone using GENTOO_MIRRORS.  If you're
talking purely about distfiles.gentoo.org, we may add something like
that by the end of transitional period.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Fabian Groffen
Hi,

On 18-10-2019 15:41:32 +0200, Michał Górny wrote:
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
> 
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
> 
> 
> Alternatively, you can:
> 
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
> 
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.

Would it be possible to run a service that sends a 302 for the
distfiles/foo-1.tar.gz to the appropriate bucket such that manual
fetching doesn't require to calculate the hash?

I prototyped this myself for distfiles.prefix, and seems like a nice
guesture for at least the transition period?

Thanks,
Fabian


-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 19, 2019, at 2:17 AM, Michał Górny  wrote:
> 
> On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
>>> 
>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
>>> Hi, everybody.
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
 This sounds like a filesystem issue. Do we know which filesystems are 
 suffering?
 ZFS should be fine. I believe ext2/ext3 have problems with this many 
 files. ext4 is probably okay, but don’t quote me on that.
>>> 
>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>> 
>> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
>> use them for a mirror). NTFS and NFS are though.
> 
> Are you surprised that people use NTFS on Windows?  Or that they use
> local mirrors over NFS?  The latter still needs to be addressed
> separatel, provided that they mount it on DISTDIR.
I am surprised that it was an issue on NTFS because it uses B-trees. As for 
NFS, I had expected that to be more dependent on the local filesystem than on 
NFS itself. If it has a slowdown when used on a filesystem that had fast 
directory operations, that might be a bug.
> 
>>> However, just because modern filesystems can handle them efficiently, it
>>> doesn't mean having directories that huge comes with zero cost.
>> While I am okay with the change, what do you mean when you say that having 
>> huge directories does not come with zero cost?
>> 
>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
>> this
> 
> O(1) or O(n)?
ZFS uses extendible hashing for its directories, so the data structure used is 
amortized O(1). You might consider it O(log n) due to the indirect tree 
traversal needed to find the direct block containing the hash table entry. With 
caching of indirect blocks, it should be amortized O(1) to find the direct 
block in practice as far as read IOs are considered. In addition, the base of 
the logarithm is 128 or 1024 depending on the pool feature flags.
> 
>> , but the impact should be negligible. Filesystems with O(log n) directory 
>> lookups would see faster directory lookups.
>> 
>> Outside of directory lookups, this could speed up up searches and sort 
>> operations when listing everything with just about any filesystem benefiting 
>> from the improvement.
>> 
>> Listing directories on such filesystems should not benefit from this unless 
>> you are using ls where the default behavior is to sort the directory 
>> contents (which is where the improvement when sorting comes into play). The 
>> need to sort the directory contents by default keeps ls from displaying 
>> anything until it has scanned the entire directory. The asymptotic 
>> complexity of a fast comparison based sort improves in this situation from 
>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
>> independently. A further speed up could be obtained by doing multithreading 
>> to parallelize the sort operations.
>> 
>> Since I know someone will call me out on that comment, I will explain. Each 
>> bucket has roughly n/b items in it where n is the total number and b is the 
>> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort 
>> each of the b buckets. The buckets are pre-sorted by prefix, so the result 
>> is now sorted. You therefore get O(nlog(n/b)) time complexity out of an 
>> O(nlogn) comparison sort on this very special case where you call it 
>> multiple times on data that has been persorted by prefix into buckets.
>> 
>> Is there any other benefit to this or did I get everything?
> 
> Listings for individual directories won't cause major pain to browsers
> anymore.  Not that there's much reason to do them.
That makes sense.
> 
> All kinds of per-direction operations will consume less memory
> and be potentially faster.
Userland would save memory when sorting or grepping a directory listing by 
virtue of having to process less data for grep and less data at a time for 
sorting (if it takes advantage of this). That would have performance benefits 
in userland.

The kernel would have little memory savings and in some cases might be slightly 
worse. It is negligible. Performance in the kernel ought to be slightly better 
on filesystems with O(log n) directory operations, but I would only 

Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
> > 
> > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny  
> > > > > > > wrote:
> > > > > > Hi, everybody.
> > > > > > It is my pleasure to announce that yesterday (EU) evening we've 
> > > > > > switched
> > > > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they 
> > > > > > upgraded
> > > > > > already -- as their caches expire (24hrs).
> > > > > > The new layout is mostly a bow towards mirror admins, for some of 
> > > > > > whom
> > > > > > having a 6+ files in a single directory have been a problem.
> > > > > > However, I suppose some of you also found e.g. the directory index
> > > > > > hardly usable due to its size.
> > > This sounds like a filesystem issue. Do we know which filesystems are 
> > > suffering?
> > > ZFS should be fine. I believe ext2/ext3 have problems with this many 
> > > files. ext4 is probably okay, but don’t quote me on that.
> > 
> > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > may apply only to older ntfs versions.  NFS has been mentioned too.
> 
> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
> use them for a mirror). NTFS and NFS are though.

Are you surprised that people use NTFS on Windows?  Or that they use
local mirrors over NFS?  The latter still needs to be addressed
separatel, provided that they mount it on DISTDIR.

> > However, just because modern filesystems can handle them efficiently, it
> > doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having 
> huge directories does not come with zero cost?
> 
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> this

O(1) or O(n)?

> , but the impact should be negligible. Filesystems with O(log n) directory 
> lookups would see faster directory lookups.
> 
> Outside of directory lookups, this could speed up up searches and sort 
> operations when listing everything with just about any filesystem benefiting 
> from the improvement.
> 
> Listing directories on such filesystems should not benefit from this unless 
> you are using ls where the default behavior is to sort the directory contents 
> (which is where the improvement when sorting comes into play). The need to 
> sort the directory contents by default keeps ls from displaying anything 
> until it has scanned the entire directory. The asymptotic complexity of a 
> fast comparison based sort improves in this situation from O(nlogn) to 
> O(nlog(n/b)) provided that you sort each subdirectory independently. A 
> further speed up could be obtained by doing multithreading to parallelize the 
> sort operations.
> 
> Since I know someone will call me out on that comment, I will explain. Each 
> bucket has roughly n/b items in it where n is the total number and b is the 
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
> comparison sort on this very special case where you call it multiple times on 
> data that has been persorted by prefix into buckets.
> 
> Is there any other benefit to this or did I get everything?

Listings for individual directories won't cause major pain to browsers
anymore.  Not that there's much reason to do them.

All kinds of per-direction operations will consume less memory
and be potentially faster.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-18 Thread Richard Yao


> On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
> 
> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>> On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
> Hi, everybody.
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
>> This sounds like a filesystem issue. Do we know which filesystems are 
>> suffering?
>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
>> ext4 is probably okay, but don’t quote me on that.
> 
> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> may apply only to older ntfs versions.  NFS has been mentioned too.

ext2 and vfat are not surprises to me (outside of the idea that anyone would 
use them for a mirror). NTFS and NFS are though.
> 
> However, just because modern filesystems can handle them efficiently, it
> doesn't mean having directories that huge comes with zero cost.
While I am okay with the change, what do you mean when you say that having huge 
directories does not come with zero cost?

Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
this, but the impact should be negligible. Filesystems with O(log n) directory 
lookups would see faster directory lookups.

Outside of directory lookups, this could speed up up searches and sort 
operations when listing everything with just about any filesystem benefiting 
from the improvement.

Listing directories on such filesystems should not benefit from this unless you 
are using ls where the default behavior is to sort the directory contents 
(which is where the improvement when sorting comes into play). The need to sort 
the directory contents by default keeps ls from displaying anything until it 
has scanned the entire directory. The asymptotic complexity of a fast 
comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) 
provided that you sort each subdirectory independently. A further speed up 
could be obtained by doing multithreading to parallelize the sort operations.

Since I know someone will call me out on that comment, I will explain. Each 
bucket has roughly n/b items in it where n is the total number and b is the 
number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
comparison sort on this very special case where you call it multiple times on 
data that has been persorted by prefix into buckets.

Is there any other benefit to this or did I get everything?

By the way, it is offtopic for the thread, but it occurs to me that a hybrid of 
radix sort and A comparison based sort could give us a general sorting 
algorithm that is asymptotically faster than O(nlogn).
> 
> [1] https://bugs.gentoo.org/534528
> 
> -- 
> Best regards,
> Michał Górny




Re: [gentoo-dev] New distfile mirror layout

2019-10-18 Thread Michał Górny
On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
> > 
> > Hi, everybody.
> > 
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> > 
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 6+ files in a single directory have been a problem. 
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> This sounds like a filesystem issue. Do we know which filesystems are 
> suffering?
> 
> ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
> ext4 is probably okay, but don’t quote me on that.

Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
may apply only to older ntfs versions.  NFS has been mentioned too.

However, just because modern filesystems can handle them efficiently, it
doesn't mean having directories that huge comes with zero cost.

[1] https://bugs.gentoo.org/534528

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-18 Thread Richard Yao



> On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
> 
> Hi, everybody.
> 
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> 
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem. 
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
This sounds like a filesystem issue. Do we know which filesystems are suffering?

ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
ext4 is probably okay, but don’t quote me on that.
> 
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
> 
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
> 
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
> 
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
> 
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
> 
> 
> Alternatively, you can:
> 
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
> 
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
> 
> 
> If you're interested in more background details and some plots, see [1].
> 
> [1] 
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> 
> -- 
> Best regards,
> Michał Górny
> 




[gentoo-dev] New distfile mirror layout

2019-10-18 Thread Michał Górny
Hi, everybody.

It is my pleasure to announce that yesterday (EU) evening we've switched
to a new distfile mirror layout.  Users will be switching to the new
layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
already -- as their caches expire (24hrs).

The new layout is mostly a bow towards mirror admins, for some of whom
having a 6+ files in a single directory have been a problem. 
However, I suppose some of you also found e.g. the directory index
hardly usable due to its size.

Throughout a transitional period (whose exact length hasn't been decided
yet), both layouts will be available.  Afterwards, the old layout will
be removed from mirrors.  This has a few implications:

1. Users who don't upgrade their package managers in time will lose
the ability of fetching from Gentoo mirrors.  This shouldn't be that
much of a problem given that the core software needed to upgrade Portage
should all have reliable upstream SRC_URIs.

2. mirror://gentoo/file URIs will stop working.  While technically you
could use mirror://gentoo/XX/file, I'd rather recommend finally
discarding its usage and moving distfiles to devspace.

3. Directly fetching files from distfiles.gentoo.org will become
a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
to use something like:

$ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
1b
$ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
...


Alternatively, you can:

$ wget http://distfiles.gentoo.org/distfiles/INDEX

and grep for the right path there.  This INDEX is also a more
lightweight alternative to HTML indexes generated by the servers.


If you're interested in more background details and some plots, see [1].

[1] 
https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part