Re: CPAN-river: can graph calculation be modified?
On 3 February 2018 at 11:28, H.Merijn Brand wrote: > Breaking something up-river of say DBI will affect just 3 authors (the > (co)maints), whereas it affect millions of people (the users). > > If some brave author maintains two or more up-river modules, it is > still just one author, but uncountable users. (don't count core modules > here, that would make it too hard). > This. While a "don't allow people to game the river" mentality might be useful for a *popularity* metric ( or an indirect sense of the CPAN authors web of trust ), its not a safe metric for deciding "what is worth testing". The darkpan plays a serious role here. There is very little "real" software on CPAN, only libraries. All the actual applications of the CPAN libraries operates outside of the realm of CPAN. And there is no way to tell how many hidden users exist of a given CPAN module. All software on CPAN is subsequently "relevant" for testing, and the only way you should use this graph is to *prioritize* which modules you'll test first. Though you should still be encouraged to test all modules, because they can all become broken due to domino effects, and there is still the high chance of there being some real world user who is using a "less popular" module. Or would you argue that something like App::DuckPAN is "Ok to break because it doesn't have any reverse dependencies"? And its quite easy to find other unarguably high-use things on CPAN which due to how they work, are *unlikely* to have reverse dependencies. Take for instance, cpanm-reporter . It would be quite easy to imagine a reality where the 2 reverse dependencies it currently has never came to exist. But its clearly not the sort of thing you want to wave your hand at as being unworthy of testing. ( Because its quite obvious there are far more people who are CPAN authors, actually use it, than there are reverse dependencies ) The river is subsequently not any kind of *authority* on what is actually being used. Its just a convenient-yet-inferior approximation. Its better than nothing, but please don't let yourself interpret it as being more than it is. KENTNL - https://metacpan.org/author/KENTNL
Re: CPAN-river: can graph calculation be modified?
On Fri, 2 Feb 2018 12:44:43 -0500, David Golden wrote: > It's possible that an *alternate* simplest thing might be more meaningful: > count the number of distinct *authors* depended on by any distribution > (including, for the sake of example, the same author, but only once). > > In the Foo case: > >- Foo has 3 authors depending on it >- Foo-Bar has 3 authors depending on it >- Foo-Bar-Noggin and Foo-Bar-Baz have 0 authors depending on it >- Foo-Bar-A has 1 author depending on it > > In the Neil's Thing case: > >- Thing has 2 >- Plant has 1 >- Fruit and Banana each have 1 >- Silver-Banana has 0 > > In Tux's Thing case, all the counts just increase by one and Distasteful > has 0. > > Consider this case: > Zot (Larry) -> Pow (Moe) -> Splat (Curly) -> Whiff (Moe) -> Oof (Larry) > >- Zot has 3 >- Pow has 3 >- Splat has 2 >- Whif has 1 >- Oof has 0 > > The interesting thing about this metric to me is that it focuses on this > question: "If a module breaks, how many *people* are affected" which sounds > a lot more like what Jim's asking. No, it tells you how many *authors* are affected (or author groups). Breaking something up-river of say DBI will affect just 3 authors (the (co)maints), whereas it affect millions of people (the users). If some brave author maintains two or more up-river modules, it is still just one author, but uncountable users. (don't count core modules here, that would make it too hard). Say we have Broum + Brumble - Droki - Blimco - Turf ALEX | BEN JOKIFLON DIY | + Fruig - DBI - DBD::XY BEN HIW JOCKX IMHO BEN should be counted twice for Broum, not once my € 0.02 > Counting an author as 1 for any downstream by the same author is arbitrary > -- I think it simplifies the analysis and gives more or less the same > answer, but it could be done the other way, too, if people preferred. > > David > > On Fri, Feb 2, 2018 at 9:48 AM, James E Keenan wrote: > > > Overall Question: How can we implement different ways of constructing the > > CPAN river? > > > > Background: > > > > Since about this time last year I've had occasion to use the concept of > > CPAN-river to derive lists of distributions to be tested against whatever > > Perl 5 blead is of the moment. In particular, for the last three months > > I've been creating assessments of the impact of monthly Perl 5 development > > releases on the "top 1000" of the CPAN river. (See, e.g., > > http://thenceforward.net/perl/misc/cpan-river-1000-perl-5.27-master.psv.gz > > ) > > > > To calculate the CPAN river, I've been using the programs developed by > > David Golden found here: > > > > https://github.com/dagolden/zzz-index-cpan-meta > > > > ... with one modification: a local branch for the second of the three > > programs cited there. I use a local branch because I'm using Linux and > > cannot install Ramdisk. > > > > Problem: > > > > As I've stared at this data over the past year I've become aware that the > > order in which distros appear in the river is not necessarily the most > > useful for assessing the real-world impact of changes in blead. Put less > > charitably, the CPAN river can be "gamed." It is possible for a person to > > release a large number of distributions which have dependencies on other > > distributions by the same author. That can boost some of those > > distributions high up into the CPAN river -- into, say, the "top 1000" that > > I use in my monthly program. > > > > But if that author's distributions are not depended upon by *other* > > authors' distributions then they are arguably less important than those > > such as Module-Build and DateTime which are depended upon by vast numbers > > of distros written by people other than those distros' maintainers. > > > > Since "testing against blead" programs take hours to run, I would like to > > have that time spent focusing on what I consider to be more relevant > > distros. > > > > For the 5.29.* development cycle starting in May of this year, I would > > like to be able to use a ranking of CPAN distros which goes beyond asking: > > > > * "How many other distributions depend on this one?" > > > > ... to asking: > > > > * "How many distributions by other authors/maintainers depend on this one?" > > > > Would that be feasible? Has anyone attempted this already? > > > > Thank you very much. > > Jim Keenan > > -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/ pgpXQK7P484Aj.pgp Description: OpenPGP digital signature
Re: CPAN-river: can graph calculation be modified?
It's possible that an *alternate* simplest thing might be more meaningful: count the number of distinct *authors* depended on by any distribution (including, for the sake of example, the same author, but only once). In the Foo case: - Foo has 3 authors depending on it - Foo-Bar has 3 authors depending on it - Foo-Bar-Noggin and Foo-Bar-Baz have 0 authors depending on it - Foo-Bar-A has 1 author depending on it In the Neil's Thing case: - Thing has 2 - Plant has 1 - Fruit and Banana each have 1 - Silver-Banana has 0 In Tux's Thing case, all the counts just increase by one and Distasteful has 0. Consider this case: Zot (Larry) -> Pow (Moe) -> Splat (Curly) -> Whiff (Moe) -> Oof (Larry) - Zot has 3 - Pow has 3 - Splat has 2 - Whif has 1 - Oof has 0 The interesting thing about this metric to me is that it focuses on this question: "If a module breaks, how many *people* are affected" which sounds a lot more like what Jim's asking. Counting an author as 1 for any downstream by the same author is arbitrary -- I think it simplifies the analysis and gives more or less the same answer, but it could be done the other way, too, if people preferred. David On Fri, Feb 2, 2018 at 9:48 AM, James E Keenan wrote: > Overall Question: How can we implement different ways of constructing the > CPAN river? > > Background: > > Since about this time last year I've had occasion to use the concept of > CPAN-river to derive lists of distributions to be tested against whatever > Perl 5 blead is of the moment. In particular, for the last three months > I've been creating assessments of the impact of monthly Perl 5 development > releases on the "top 1000" of the CPAN river. (See, e.g., > http://thenceforward.net/perl/misc/cpan-river-1000-perl-5.27-master.psv.gz > ) > > To calculate the CPAN river, I've been using the programs developed by > David Golden found here: > > https://github.com/dagolden/zzz-index-cpan-meta > > ... with one modification: a local branch for the second of the three > programs cited there. I use a local branch because I'm using Linux and > cannot install Ramdisk. > > Problem: > > As I've stared at this data over the past year I've become aware that the > order in which distros appear in the river is not necessarily the most > useful for assessing the real-world impact of changes in blead. Put less > charitably, the CPAN river can be "gamed." It is possible for a person to > release a large number of distributions which have dependencies on other > distributions by the same author. That can boost some of those > distributions high up into the CPAN river -- into, say, the "top 1000" that > I use in my monthly program. > > But if that author's distributions are not depended upon by *other* > authors' distributions then they are arguably less important than those > such as Module-Build and DateTime which are depended upon by vast numbers > of distros written by people other than those distros' maintainers. > > Since "testing against blead" programs take hours to run, I would like to > have that time spent focusing on what I consider to be more relevant > distros. > > For the 5.29.* development cycle starting in May of this year, I would > like to be able to use a ranking of CPAN distros which goes beyond asking: > > * "How many other distributions depend on this one?" > > ... to asking: > > * "How many distributions by other authors/maintainers depend on this one?" > > Would that be feasible? Has anyone attempted this already? > > Thank you very much. > Jim Keenan > -- David Golden Twitter/IRC/GitHub: @xdg
Re: CPAN-river: can graph calculation be modified? Neil Bowers
On 02/02/2018 11:08 AM, H.Merijn Brand wrote: On Fri, 2 Feb 2018 15:51:32 +, Neil Bowers wrote: For the 5.29.* development cycle starting in May of this year, I would like to be able to use a ranking of CPAN distros which goes beyond asking: * "How many other distributions depend on this one?" ... to asking: * "How many distributions by other authors/maintainers depend on this one?" Would that be feasible? Has anyone attempted this already? When we were discussing the River model at QAH, and in discussions afterwards, this came up. In the end we decided to keep things simple and go with the current common definition. There are some tools in the CPAN ecosystem that only count dependencies written by others. We’d need to agree which dists get ignored in this alternate scheme. Consider this example: Here MARY has released a bunch of dists, but Foo-Bar is also relied on by other dists written by MUNGO and MIDGE. The river count for Foo-Bar would be 2 here (ignoring the whole branch that contains only dists from MARY), but the Foo river count should be 3, I think. Foo-Bar “counts”, because it in turn is depended on by dists from other authors. Otherwise the river count would be 2 for both Foo and Foo-Bar. Basically we’re starting at the “bottom" of the dependency graph, and trimming sub-graphs all from one author. Also consider this example: What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in this alternate measure. 1 or 3: 1 if module chains from the same author are "compressed" to 1, 3 if not More interesting would be Thing - Plant - Fruit - Banana - Silver Banana - Distasteful stuff JOHNPAULRINGO RINGORINGO GEORGE would plant now be 1, 2, or 4? I.e. for sub-graphs by the same author, you only include the dist at the head of the sub-graph. I'd suggest to have an option to squeeze any unbranched chain of modules from the same author to 1 I *think* that's what I'm aiming for. Let's say I have a CPAN distro called Gamma on which nothing else depends. I refactor code out of Gamma into Beta, such that Gamma now depends on Beta. By the standard definition, Beta moves up-river, Gamma down-river. Next I refactor code out of Beta into Alpha. Alpha is now farther up-river than both Beta and Gamma. Suppose that Alpha now falls into the "top 1000" of the CPAN river. When I then switch Perl community roles and start to play the role of "rapid BBC evaluator." A certain portion of my BBC program is now taken up with testing Alpha. But, assuming I confine my focus to the top 1000, that means some *other* CPAN distribution -- perhaps one whose revdeps are from different authors -- has been pushed out of the top 1000. That means the data I generate for P5P has been skewed toward myself. That's what I'd like to avert. It would be useful to have both measures available: raw-river and author-river. When looking at a dist there are (at least) three figures that might be of interest: the full river count (total number of direct and indirect dependencies), the author-filtered river count (as above), and the number of direct dependencies (which could be split in 2 as well). Neil Thank you very much. Jim Keenan
Re: CPAN-river: can graph calculation be modified?
On 02/02/2018 10:51 AM, Neil Bowers wrote: For the 5.29.* development cycle starting in May of this year, I would like to be able to use a ranking of CPAN distros which goes beyond asking: * "How many other distributions depend on this one?" ... to asking: * "How many distributions by other authors/maintainers depend on this one?" Would that be feasible? Has anyone attempted this already? When we were discussing the River model at QAH, and in discussions afterwards, this came up. In the end we decided to keep things simple and go with the current common definition. There are some tools in the CPAN ecosystem that only count dependencies written by others. Can you point us toward those tools? We’d need to agree which dists get ignored in this alternate scheme. Please note that I'm not looking to replace the current definition. I'm looking to develop supplementary definition(s) -- and their implementations -- that can be useful in particular circumstances. Consider this example: Here MARY has released a bunch of dists, but Foo-Bar is also relied on by other dists written by MUNGO and MIDGE. The river count for Foo-Bar would be 2 here (ignoring the whole branch that contains only dists from MARY), but the Foo river count should be 3, I think. Foo-Bar “counts”, because it in turn is depended on by dists from other authors. Otherwise the river count would be 2 for both Foo and Foo-Bar. Basically we’re starting at the “bottom" of the dependency graph, and trimming sub-graphs all from one author. Also consider this example: What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in this alternate measure. I.e. for sub-graphs by the same author, you only include the dist at the head of the sub-graph. It would be useful to have both measures available: raw-river and author-river. When looking at a dist there are (at least) three figures that might be of interest: the full river count (total number of direct and indirect dependencies), the author-filtered river count (as above), and the number of direct dependencies (which could be split in 2 as well). Neil Thank you very much. Jim Keenan
Re: CPAN-river: can graph calculation be modified?
On Fri, 2 Feb 2018 15:51:32 +, Neil Bowers wrote: > > For the 5.29.* development cycle starting in May of this year, I would like > > to be able to use a ranking of CPAN distros which goes beyond asking: > > > > * "How many other distributions depend on this one?" > > > > ... to asking: > > > > * "How many distributions by other authors/maintainers depend on this one?" > > > > Would that be feasible? Has anyone attempted this already? > > When we were discussing the River model at QAH, and in discussions > afterwards, this came up. In the end we decided to keep things simple and go > with the current common definition. There are some tools in the CPAN > ecosystem that only count dependencies written by others. > > We’d need to agree which dists get ignored in this alternate scheme. Consider > this example: > > > > Here MARY has released a bunch of dists, but Foo-Bar is also relied > on by other dists written by MUNGO and MIDGE. > > The river count for Foo-Bar would be 2 here (ignoring the whole > branch that contains only dists from MARY), but the Foo river count > should be 3, I think. Foo-Bar “counts”, because it in turn is > depended on by dists from other authors. Otherwise the river count > would be 2 for both Foo and Foo-Bar. Basically we’re starting at the > “bottom" of the dependency graph, and trimming sub-graphs all from > one author. > Also consider this example: > > What’s the river count of Plant — 0, 1, or 3? I think it should be 1, > in this alternate measure. 1 or 3: 1 if module chains from the same author are "compressed" to 1, 3 if not More interesting would be Thing - Plant - Fruit - Banana - Silver Banana - Distasteful stuff JOHNPAULRINGO RINGORINGO GEORGE would plant now be 1, 2, or 4? > I.e. for sub-graphs by the same author, you only include the dist at > the head of the sub-graph. I'd suggest to have an option to squeeze any unbranched chain of modules from the same author to 1 > It would be useful to have both measures available: raw-river and > author-river. > > When looking at a dist there are (at least) three figures that might > be of interest: the full river count (total number of direct and > indirect dependencies), the author-filtered river count (as above), > and the number of direct dependencies (which could be split in 2 as > well). > > Neil -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/ pgpraIMz8N34E.pgp Description: OpenPGP digital signature
Re: CPAN-river: can graph calculation be modified?
> For the 5.29.* development cycle starting in May of this year, I would like > to be able to use a ranking of CPAN distros which goes beyond asking: > > * "How many other distributions depend on this one?" > > ... to asking: > > * "How many distributions by other authors/maintainers depend on this one?" > > Would that be feasible? Has anyone attempted this already? When we were discussing the River model at QAH, and in discussions afterwards, this came up. In the end we decided to keep things simple and go with the current common definition. There are some tools in the CPAN ecosystem that only count dependencies written by others. We’d need to agree which dists get ignored in this alternate scheme. Consider this example: Here MARY has released a bunch of dists, but Foo-Bar is also relied on by other dists written by MUNGO and MIDGE. The river count for Foo-Bar would be 2 here (ignoring the whole branch that contains only dists from MARY), but the Foo river count should be 3, I think. Foo-Bar “counts”, because it in turn is depended on by dists from other authors. Otherwise the river count would be 2 for both Foo and Foo-Bar. Basically we’re starting at the “bottom" of the dependency graph, and trimming sub-graphs all from one author. Also consider this example: What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in this alternate measure. I.e. for sub-graphs by the same author, you only include the dist at the head of the sub-graph. It would be useful to have both measures available: raw-river and author-river. When looking at a dist there are (at least) three figures that might be of interest: the full river count (total number of direct and indirect dependencies), the author-filtered river count (as above), and the number of direct dependencies (which could be split in 2 as well). Neil

