[Reproducible-builds] proposal: store information in one place instead of multiple ones

Johannes Schauer Tue, 28 Jul 2015 12:09:24 -0700

Hi,

here are several questions I have which, for me boil down to information being
duplicated and stored in different locations, leading to possible confusion for
contributors and added work when adding new bugs and issues:


1. Why is the set of bts usertags different from the set of r-b issues? The bts
   usertags seem to be way more broad. For example there is the usertag
   "timestamps" which matches many issues. It is currently impossible get a
   machine readable mapping from bug number to the issue it fixes because the
   usertags are much too broad.

   A solution would be to ditch the current usertags and use the issue names
   instead. This would allow a one-to-one mapping between issue and bug number.

   What is the utility of the current usertags? Are they used for anything?

   Currently I'd rather say that it's confusing to have a package associated
   with two disjunct sets of tags: the usertags and the issues. Why is that
   useful?

   Another helpful thing would be if the bug subject line wasn't as generic
   (ie. if it was not just verbatim copied from the template) but that's
   another problem.

2. Why does packages.yml store the bug number(s) for each package? This
   information can easily retrieved from the bts and then will also not be
   outdated. packages.yml easily lags behind the actual bts information if not
   regularly updated by someone.

3. Why are the issues explained in issues.yml *and* in the wiki? There should
   be one canonical place to describe them because currently, any new issue
   that is identified requires to edit multiple resources and then link between
   the two. This not only requires more work when creating the issue but when
   looking up issues it is also unclear which resource is the authoritative one
   and which one will give the desired information. Instead, the information
   should be stored in one place only.

So my proposal is:

1. Instead of using the current usertags "toolchain", "infrastructure",
   "timestamps" and so on, use issue names instead.

   Since each bugs can have multiple usertags, the old tags could even be kept
   and the issue names be added in addition.

   Since packages.yml exists, much of this conversion could probably be even
   automated (except for packages with more than one bug open for them).

   Sometimes, reproducibility problems only affect a single package and in that
   case it would create too much overhead to create a new issue for it. But in
   that case, why not just create a dummy issue just for the purpose to
   associate this kind of bugs to the reproducible builds team?

2. Do not add bug numbers to packages.yml. The bts already stores the
   information which source package has which bugs by the reproducible builds
   team.

3. Use the wiki only to describe issues and ditch issues.yml. The advantages
   are that the Debian wiki offers a much richer syntax and is also editable by
   everybody in Debian and not only the reproducible builds team.

4. After this is done, it is hard to say why the notes.git is useful in the
   first place. The content of issues.yml is described in the Debian wiki and
   the bug numbers are stored in the bts. One last task of packages.yml would
   probably be to store some tiny notes for packages for which there doesn't
   exist a bug. But I'd say to also move these notes into the bts. I think that
   filing a bug about a package's unreproducibility should be done even without
   having a fix for it. In fact many packages with such bugs exist simply for
   the reason that at the time the bug was filed, jenkins did less checks than
   it does now, so the patch which is currently in the bts does not make the
   package fully reproducibly anymore. Furthermore, storing these notes in the
   bts might make the package maintainer aware of the issue and gives them a
   chance to comment on these notes. I would say it gives maintainers more
   incentive to react on the issue themselves that way.

On IRC the following problems were raised:

 - "you cannot grep packages.yml if it's in the bts because query-ing it takes
   ages"

     * true, but then just cache it. In fact, that's what you are already doing
       semi-manually in the notes.git by running the clean-notes script over
       it. So instead of having to wait until somebody runs clean-notes and
       then does `git add packages.yml && git commit && git push` from time to
       time, how about automating this process and then publishing a fully
       machine generated packages.yml which can then be grepped?

 - "there is more information in packages.yml than in the bts"

    * true, and i think that's a bug. By having this information in
      packages.yml and through that on reproducible.d.n, you are not informing
      the maintainer of the package about the little info you just found
      analyzing their package for reproducibility issues. Instead, I think you
      should file a bug and write the small information you gathered there. If
      you find out more later, reply to that bug. This way you actively engage
      with the maintainers who might then even feel more compelled to help you
      or are even made aware of the problem in the first place.

 - "it is harder to keep track of packages affected by toolchain issues with
   the bts"

    * yes, but the bug against the toolchain package is also way more
      impressive and way more likely to get attention if it is blocking 100
      other bugs. Those blocking bugs also don't need to be empty placeholder
      bugs. Many packages have more than one issue, and a generic "please make
      this package reproducible" bug could then be blocked by a bug in one or
      more toolchain packages.

My basic idea is: please have *one* unified place to store information instead
of duplicating it in multiple places. Having multiple places requires more work
and makes it confusing for contributors where they should add their work or
which resource has precedence over the other. The Debian bts and the Debian
wiki can be used and is being used by all of Debian and thus available and
familiar to a much broader audience. By using it for your own work you are
making it easier for other to contribute or to learn from your findings. If
they are insufficient for your use case, work around it through their machine
readable interfaces (for example, keep an automatically updated cache about
information from the bts for quick retrieval) or improve the Debian tools like
the bts and wiki by making them faster or have the futures you want. People in
this project spend tons and tons of hours in the infrastructure that runs the
project and I wonder what would happen if all this amazing energy was spent in
improving the tools that the rest of Debian uses instead of re-inventing the
wheel?

Finally, these are just my 2 cents and I'm just a random bystander who finds
reproducible builds awesome, does one or two contributions a month and mentors
akira a bit, so I want to point out that this is your time and of course you
are totally free to do with it whatever you like. What made me write this email
was the overhead that I saw when mentoring akira: That she had to file a bug in
the bts *and* updated packages.yml. That she has to update issues.yml *and* the
Debian wiki. I think switching context from one system to the other (git, bts,
wiki) takes quite some time and causes quite a bit of overhead which could be
avoided if information was not duplicated. So this is really just my 2 cents
and I really do not mean to tell you what to do but just wanted to state some
observations from the perspective of an outsider :)

Thanks!

cheers, josch

signature.asc
Description: signature

_______________________________________________
Reproducible-builds mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/reproducible-builds

[Reproducible-builds] proposal: store information in one place instead of multiple ones

Reply via email to