Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread Kent Fredric
On 3 February 2018 at 11:28, H.Merijn Brand  wrote:
> Breaking something up-river of say DBI will affect just 3 authors (the
> (co)maints), whereas it affect millions of people (the users).
>
> If some brave author maintains two or more up-river modules, it is
> still just one author, but uncountable users. (don't count core modules
> here, that would make it too hard).
>

This.

While a "don't allow people to game the river" mentality might be
useful for a *popularity* metric ( or an indirect sense of the CPAN
authors web of trust ), its not a safe metric for deciding "what is
worth testing".

The darkpan plays a serious role here.

There is very little "real" software on CPAN, only libraries. All the
actual applications of the CPAN libraries operates outside of the
realm of CPAN.

And there is no way to tell how many hidden users exist of a given CPAN module.

All software on CPAN is subsequently "relevant" for testing, and the
only way you should use this graph is to *prioritize* which modules
you'll test first.

Though you should still be encouraged to test all modules, because
they can all become broken due to domino effects, and there is still
the high chance of there being some real world user who is using a
"less popular" module.

Or would you argue that something like App::DuckPAN is "Ok to break
because it doesn't have any reverse dependencies"?

And its quite easy to find other unarguably high-use things on CPAN
which due to how they work, are *unlikely* to have reverse
dependencies.

Take for instance, cpanm-reporter .

It would be quite easy to imagine a reality where the 2 reverse
dependencies it currently has never came to exist. But its clearly not
the sort of thing you want to wave your hand at as being unworthy of
testing. ( Because its quite obvious there are far more people who are
CPAN authors, actually use it, than there are reverse dependencies )

The river is subsequently not any kind of *authority* on what is
actually being used. Its just a convenient-yet-inferior approximation.
Its better than nothing, but please don't let yourself interpret it as
being more than it is.


KENTNL - https://metacpan.org/author/KENTNL


Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread H.Merijn Brand
On Fri, 2 Feb 2018 12:44:43 -0500, David Golden  wrote:

> It's possible that an *alternate* simplest thing might be more meaningful:
> count the number of distinct *authors* depended on by any distribution
> (including, for the sake of example, the same author, but only once).
> 
> In the Foo case:
> 
>- Foo has 3 authors depending on it
>- Foo-Bar has 3 authors depending on it
>- Foo-Bar-Noggin and Foo-Bar-Baz have 0 authors depending on it
>- Foo-Bar-A has 1 author depending on it
> 
> In the Neil's Thing case:
> 
>- Thing has 2
>- Plant has 1
>- Fruit and Banana each have 1
>- Silver-Banana has 0
> 
> In Tux's Thing case, all the counts just increase by one and Distasteful
> has 0.
> 
> Consider this case:
> Zot (Larry) -> Pow (Moe) -> Splat (Curly) -> Whiff (Moe) -> Oof (Larry)
> 
>- Zot has 3
>- Pow has 3
>- Splat has 2
>- Whif has 1
>- Oof has 0
> 
> The interesting thing about this metric to me is that it focuses on this
> question: "If a module breaks, how many *people* are affected" which sounds
> a lot more like what Jim's asking.

No, it tells you how many *authors* are affected (or author groups).

Breaking something up-river of say DBI will affect just 3 authors (the
(co)maints), whereas it affect millions of people (the users).

If some brave author maintains two or more up-river modules, it is
still just one author, but uncountable users. (don't count core modules
here, that would make it too hard).

Say we have


  Broum + Brumble - Droki - Blimco - Turf
  ALEX  | BEN   JOKIFLON DIY
|
+ Fruig   - DBI   - DBD::XY
  BEN   HIW JOCKX

IMHO BEN should be counted twice for Broum, not once

my € 0.02

> Counting an author as 1 for any downstream by the same author is arbitrary
> -- I think it simplifies the analysis and gives more or less the same
> answer, but it could be done the other way, too, if people preferred.
> 
> David
> 
> On Fri, Feb 2, 2018 at 9:48 AM, James E Keenan  wrote:
> 
> > Overall Question:  How can we implement different ways of constructing the
> > CPAN river?
> >
> > Background:
> >
> > Since about this time last year I've had occasion to use the concept of
> > CPAN-river to derive lists of distributions to be tested against whatever
> > Perl 5 blead is of the moment.  In particular, for the last three months
> > I've been creating assessments of the impact of monthly Perl 5 development
> > releases on the "top 1000" of the CPAN river.  (See, e.g.,
> > http://thenceforward.net/perl/misc/cpan-river-1000-perl-5.27-master.psv.gz
> > )
> >
> > To calculate the CPAN river, I've been using the programs developed by
> > David Golden found here:
> >
> > https://github.com/dagolden/zzz-index-cpan-meta
> >
> > ... with one modification:  a local branch for the second of the three
> > programs cited there.  I use a local branch because I'm using Linux and
> > cannot install Ramdisk.
> >
> > Problem:
> >
> > As I've stared at this data over the past year I've become aware that the
> > order in which distros appear in the river is not necessarily the most
> > useful for assessing the real-world impact of changes in blead. Put less
> > charitably, the CPAN river can be "gamed."  It is possible for a person to
> > release a large number of distributions which have dependencies on other
> > distributions by the same author.  That can boost some of those
> > distributions high up into the CPAN river -- into, say, the "top 1000" that
> > I use in my monthly program.
> >
> > But if that author's distributions are not depended upon by *other*
> > authors' distributions then they are arguably less important than those
> > such as Module-Build and DateTime which are depended upon by vast numbers
> > of distros written by people other than those distros' maintainers.
> >
> > Since "testing against blead" programs take hours to run, I would like to
> > have that time spent focusing on what I consider to be more relevant
> > distros.
> >
> > For the 5.29.* development cycle starting in May of this year, I would
> > like to be able to use a ranking of CPAN distros which goes beyond asking:
> >
> > * "How many other distributions depend on this one?"
> >
> > ... to asking:
> >
> > * "How many distributions by other authors/maintainers depend on this one?"
> >
> > Would that be feasible?  Has anyone attempted this already?
> >
> > Thank you very much.
> > Jim Keenan
> >  


-- 
H.Merijn Brand  http://tux.nl   Perl Monger  http://amsterdam.pm.org/
using perl5.00307 .. 5.27   porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/http://www.test-smoke.org/
http://qa.perl.org   http://www.goldmark.org/jeff/stupid-disclaimers/


pgpXQK7P484Aj.pgp
Description: OpenPGP digital signature


Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread David Golden
It's possible that an *alternate* simplest thing might be more meaningful:
count the number of distinct *authors* depended on by any distribution
(including, for the sake of example, the same author, but only once).

In the Foo case:

   - Foo has 3 authors depending on it
   - Foo-Bar has 3 authors depending on it
   - Foo-Bar-Noggin and Foo-Bar-Baz have 0 authors depending on it
   - Foo-Bar-A has 1 author depending on it

In the Neil's Thing case:

   - Thing has 2
   - Plant has 1
   - Fruit and Banana each have 1
   - Silver-Banana has 0

In Tux's Thing case, all the counts just increase by one and Distasteful
has 0.

Consider this case:
Zot (Larry) -> Pow (Moe) -> Splat (Curly) -> Whiff (Moe) -> Oof (Larry)


   - Zot has 3
   - Pow has 3
   - Splat has 2
   - Whif has 1
   - Oof has 0

The interesting thing about this metric to me is that it focuses on this
question: "If a module breaks, how many *people* are affected" which sounds
a lot more like what Jim's asking.

Counting an author as 1 for any downstream by the same author is arbitrary
-- I think it simplifies the analysis and gives more or less the same
answer, but it could be done the other way, too, if people preferred.

David


On Fri, Feb 2, 2018 at 9:48 AM, James E Keenan  wrote:

> Overall Question:  How can we implement different ways of constructing the
> CPAN river?
>
> Background:
>
> Since about this time last year I've had occasion to use the concept of
> CPAN-river to derive lists of distributions to be tested against whatever
> Perl 5 blead is of the moment.  In particular, for the last three months
> I've been creating assessments of the impact of monthly Perl 5 development
> releases on the "top 1000" of the CPAN river.  (See, e.g.,
> http://thenceforward.net/perl/misc/cpan-river-1000-perl-5.27-master.psv.gz
> )
>
> To calculate the CPAN river, I've been using the programs developed by
> David Golden found here:
>
> https://github.com/dagolden/zzz-index-cpan-meta
>
> ... with one modification:  a local branch for the second of the three
> programs cited there.  I use a local branch because I'm using Linux and
> cannot install Ramdisk.
>
> Problem:
>
> As I've stared at this data over the past year I've become aware that the
> order in which distros appear in the river is not necessarily the most
> useful for assessing the real-world impact of changes in blead. Put less
> charitably, the CPAN river can be "gamed."  It is possible for a person to
> release a large number of distributions which have dependencies on other
> distributions by the same author.  That can boost some of those
> distributions high up into the CPAN river -- into, say, the "top 1000" that
> I use in my monthly program.
>
> But if that author's distributions are not depended upon by *other*
> authors' distributions then they are arguably less important than those
> such as Module-Build and DateTime which are depended upon by vast numbers
> of distros written by people other than those distros' maintainers.
>
> Since "testing against blead" programs take hours to run, I would like to
> have that time spent focusing on what I consider to be more relevant
> distros.
>
> For the 5.29.* development cycle starting in May of this year, I would
> like to be able to use a ranking of CPAN distros which goes beyond asking:
>
> * "How many other distributions depend on this one?"
>
> ... to asking:
>
> * "How many distributions by other authors/maintainers depend on this one?"
>
> Would that be feasible?  Has anyone attempted this already?
>
> Thank you very much.
> Jim Keenan
>



-- 
David Golden  Twitter/IRC/GitHub: @xdg


Re: CPAN-river: can graph calculation be modified? Neil Bowers

2018-02-02 Thread James E Keenan

On 02/02/2018 11:08 AM, H.Merijn Brand wrote:

On Fri, 2 Feb 2018 15:51:32 +, Neil Bowers
 wrote:


For the 5.29.* development cycle starting in May of this year, I would like to 
be able to use a ranking of CPAN distros which goes beyond asking:

* "How many other distributions depend on this one?"

... to asking:

* "How many distributions by other authors/maintainers depend on this one?"

Would that be feasible?  Has anyone attempted this already?


When we were discussing the River model at QAH, and in discussions afterwards, 
this came up. In the end we decided to keep things simple and go with the 
current common definition. There are some tools in the CPAN ecosystem that only 
count dependencies written by others.

We’d need to agree which dists get ignored in this alternate scheme. Consider 
this example:



Here MARY has released a bunch of dists, but Foo-Bar is also relied
on by other dists written by MUNGO and MIDGE.

The river count for Foo-Bar would be 2 here (ignoring the whole
branch that contains only dists from MARY), but the Foo river count
should be 3, I think. Foo-Bar “counts”, because it in turn is
depended on by dists from other authors. Otherwise the river count
would be 2 for both Foo and Foo-Bar. Basically we’re starting at the
“bottom" of the dependency graph, and trimming sub-graphs all from
one author.




Also consider this example:

What’s the river count of Plant — 0, 1, or 3? I think it should be 1,
in this alternate measure.


1 or 3: 1 if module chains from the same author are "compressed" to 1,
3 if not

More interesting would be

  Thing - Plant - Fruit - Banana - Silver Banana - Distasteful stuff
  JOHNPAULRINGO   RINGORINGO   GEORGE

would plant now be 1, 2, or 4?


I.e. for sub-graphs by the same author, you only include the dist at
the head of the sub-graph.


I'd suggest to have an option to squeeze any unbranched chain of
modules from the same author to 1



I *think* that's what I'm aiming for.  Let's say I have a CPAN distro 
called Gamma on which nothing else depends.  I refactor code out of 
Gamma into Beta, such that Gamma now depends on Beta.  By the standard 
definition, Beta moves up-river, Gamma down-river.


Next I refactor code out of Beta into Alpha.  Alpha is now farther 
up-river than both Beta and Gamma.


Suppose that Alpha now falls into the "top 1000" of the CPAN river. 
When I then switch Perl community roles and start to play the role of 
"rapid BBC evaluator."  A certain portion of my BBC program is now taken 
up with testing Alpha.  But, assuming I confine my focus to the top 
1000, that means some *other* CPAN distribution -- perhaps one whose 
revdeps are from different authors -- has been pushed out of the top 
1000.  That means the data I generate for P5P has been skewed toward 
myself.  That's what I'd like to avert.




It would be useful to have both measures available: raw-river and
author-river.

When looking at a dist there are (at least) three figures that might
be of interest: the full river count (total number of direct and
indirect dependencies), the author-filtered river count (as above),
and the number of direct dependencies (which could be split in 2 as
well).

Neil




Thank you very much.
Jim Keenan


Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread James E Keenan

On 02/02/2018 10:51 AM, Neil Bowers wrote:
For the 5.29.* development cycle starting in May of this year, I would 
like to be able to use a ranking of CPAN distros which goes beyond asking:


* "How many other distributions depend on this one?"

... to asking:

* "How many distributions by other authors/maintainers depend on this 
one?"


Would that be feasible?  Has anyone attempted this already?


When we were discussing the River model at QAH, and in discussions 
afterwards, this came up. In the end we decided to keep things simple 
and go with the current common definition. There are some tools in the 
CPAN ecosystem that only count dependencies written by others.




Can you point us toward those tools?

We’d need to agree which dists get ignored in this alternate scheme. 


Please note that I'm not looking to replace the current definition.  I'm 
looking to develop supplementary definition(s) -- and their 
implementations -- that can be useful in particular circumstances.



Consider this example:


Here MARY has released a bunch of dists, but Foo-Bar is also relied on 
by other dists written by MUNGO and MIDGE.


The river count for Foo-Bar would be 2 here (ignoring the whole branch 
that contains only dists from MARY), but the Foo river count should be 
3, I think. Foo-Bar “counts”, because it in turn is depended on by dists 
from other authors. Otherwise the river count would be 2 for both Foo 
and Foo-Bar. Basically we’re starting at the “bottom" of the dependency 
graph, and trimming sub-graphs all from one author.


Also consider this example:


What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in 
this alternate measure.


I.e. for sub-graphs by the same author, you only include the dist at the 
head of the sub-graph.


It would be useful to have both measures available: raw-river and 
author-river.


When looking at a dist there are (at least) three figures that might be 
of interest: the full river count (total number of direct and indirect 
dependencies), the author-filtered river count (as above), and the 
number of direct dependencies (which could be split in 2 as well).


Neil



Thank you very much.
Jim Keenan


Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread H.Merijn Brand
On Fri, 2 Feb 2018 15:51:32 +, Neil Bowers
 wrote:

> > For the 5.29.* development cycle starting in May of this year, I would like 
> > to be able to use a ranking of CPAN distros which goes beyond asking:
> > 
> > * "How many other distributions depend on this one?"
> > 
> > ... to asking:
> > 
> > * "How many distributions by other authors/maintainers depend on this one?"
> > 
> > Would that be feasible?  Has anyone attempted this already?  
> 
> When we were discussing the River model at QAH, and in discussions 
> afterwards, this came up. In the end we decided to keep things simple and go 
> with the current common definition. There are some tools in the CPAN 
> ecosystem that only count dependencies written by others.
> 
> We’d need to agree which dists get ignored in this alternate scheme. Consider 
> this example:
> 
> 
> 
> Here MARY has released a bunch of dists, but Foo-Bar is also relied
> on by other dists written by MUNGO and MIDGE.
> 
> The river count for Foo-Bar would be 2 here (ignoring the whole
> branch that contains only dists from MARY), but the Foo river count
> should be 3, I think. Foo-Bar “counts”, because it in turn is
> depended on by dists from other authors. Otherwise the river count
> would be 2 for both Foo and Foo-Bar. Basically we’re starting at the
> “bottom" of the dependency graph, and trimming sub-graphs all from
> one author.


> Also consider this example:
>
> What’s the river count of Plant — 0, 1, or 3? I think it should be 1,
> in this alternate measure.

1 or 3: 1 if module chains from the same author are "compressed" to 1,
3 if not

More interesting would be

 Thing - Plant - Fruit - Banana - Silver Banana - Distasteful stuff
 JOHNPAULRINGO   RINGORINGO   GEORGE

would plant now be 1, 2, or 4? 

> I.e. for sub-graphs by the same author, you only include the dist at
> the head of the sub-graph.

I'd suggest to have an option to squeeze any unbranched chain of
modules from the same author to 1

> It would be useful to have both measures available: raw-river and
> author-river.
> 
> When looking at a dist there are (at least) three figures that might
> be of interest: the full river count (total number of direct and
> indirect dependencies), the author-filtered river count (as above),
> and the number of direct dependencies (which could be split in 2 as
> well).
> 
> Neil

-- 
H.Merijn Brand  http://tux.nl   Perl Monger  http://amsterdam.pm.org/
using perl5.00307 .. 5.27   porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/http://www.test-smoke.org/
http://qa.perl.org   http://www.goldmark.org/jeff/stupid-disclaimers/


pgpraIMz8N34E.pgp
Description: OpenPGP digital signature


Re: CPAN-river: can graph calculation be modified?

2018-02-02 Thread Neil Bowers
> For the 5.29.* development cycle starting in May of this year, I would like 
> to be able to use a ranking of CPAN distros which goes beyond asking:
> 
> * "How many other distributions depend on this one?"
> 
> ... to asking:
> 
> * "How many distributions by other authors/maintainers depend on this one?"
> 
> Would that be feasible?  Has anyone attempted this already?

When we were discussing the River model at QAH, and in discussions afterwards, 
this came up. In the end we decided to keep things simple and go with the 
current common definition. There are some tools in the CPAN ecosystem that only 
count dependencies written by others.

We’d need to agree which dists get ignored in this alternate scheme. Consider 
this example:



Here MARY has released a bunch of dists, but Foo-Bar is also relied on by other 
dists written by MUNGO and MIDGE.

The river count for Foo-Bar would be 2 here (ignoring the whole branch that 
contains only dists from MARY), but the Foo river count should be 3, I think. 
Foo-Bar “counts”, because it in turn is depended on by dists from other 
authors. Otherwise the river count would be 2 for both Foo and Foo-Bar. 
Basically we’re starting at the “bottom" of the dependency graph, and trimming 
sub-graphs all from one author.

Also consider this example:



What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in this 
alternate measure.

I.e. for sub-graphs by the same author, you only include the dist at the head 
of the sub-graph.

It would be useful to have both measures available: raw-river and author-river.

When looking at a dist there are (at least) three figures that might be of 
interest: the full river count (total number of direct and indirect 
dependencies), the author-filtered river count (as above), and the number of 
direct dependencies (which could be split in 2 as well).

Neil