Re: [gentoo-user] almost free launch: an idea to lower build time, and rice, at the same time

2019-11-07 Thread Wols Lists
On 05/11/19 15:05, Mickaël Bucas wrote:
> I remember reading an article about a man trying to reproduce binary
> packages of a binary distribution and failing to do so, because there
> are so many parts involved. I've read later that distributions have
> done some work to have reproducible builds, but I'm not sure how
> successful they are, even when all choices are predefined.

It gets worse ... a major cause of two consecutive compiles on the same
system not agreeing is that a lot of this contains date stamps etc.

Reproducible builds are coming along, but they've got to analyze out or
remove all the compile time info that ends up in the binary. They're
coming because they're needed for security purposes.

Cheers,
Wol



Re: [gentoo-user] almost free launch: an idea to lower build time, and rice, at the same time

2019-11-05 Thread Caveman Al Toraboran
On Tuesday, November 5, 2019 7:05 PM, Mickaël Bucas  wrote:
> Hi Caveman
>
> The Portage tree contains a few binary packages prepared by Gentoo
> developers, like Firefox, Rust, LibreOffice...
> "ls -d /usr/portage//-bin" shows about 90 packages prepared in this
> way, some of them because they are non-free like Oracle JDK
>
> This means that there is no necessary changes to Gentoo to accomplish
> what you describe : compile the packages, write the ebuilds for the
> binary packages, publish ebuilds in an overlay.

Some qt-related packages are really slow to compile, yet still not listed.
A problem with this approach is that IMO it's too manual and doesn't react
dynamically to user changes.

IMO we can consider this an automated community-driven bin-host that uses
statistics in order to tell which packages are reliable.  In case of hardware
mismatches, I think we can find a binary that's compiled with the desired,
say, USE flags, but compiled on an older CPU model that's backward compatible
with the newer rare one that one might be using.

> But the really short list above shows that it's a really complex task
> because of all dependencies and configurable elements in Gentoo. If
> you just have a look at the output of "emerge --info" you can imagine
> all the moving parts, like compiler versions and compile options,
> Bash, Perl, Python, Init system, USE flags (combinatorial), even human
> languages. And that is just the easily visible parts !

True, however a few points:

* If we look at that info, from the perspective of individual packages, it is
  has much less degrees of variations in practice.  E.g. if we look at the USE
  flags dimension, dev-qt/qtwebengine has 12 of them, so worst case for this
  aspect we get about:

nchoosek(12,1) + nchoosek(12,2) + ... + nchoosek(12,12) = 4095

  possible combinations with those 12 flags.  But,
  most people are only interested in 2 sets of potential USE flag
  configurations, one with ALSA, or another with PA.  So in practice, that 4095
  is probably reduced to just 2 or 3 clusters of configurations (not 4095).

* For hardware details, such as the exact CPU model and the kinds of features
  actually enabled by the compiler when using `-march=native`.   I don't know
  the actual distribution of this in practice, but is it not possible
  that users can be given the choice to simply pick a binary that's compiled on
  an older backwards compatible CPU?

  E.g. the system could prompt the user the nearest (e.g. in selection of USE
  flags) to his query, by presenting the user with a binary compiled with an
  older x86-64 CPU model than his newer x86-64 CPU.

  This way, this could become simply an automated bin-host that blurs as
  necessary, and forks variations of specific configurations as demand raises,
  all without needing manual dev time to package *-bin manually.


> I remember reading an article about a man trying to reproduce binary
> packages of a binary distribution and failing to do so, because there
> are so many parts involved. I've read later that distributions have
> done some work to have reproducible builds, but I'm not sure how
> successful they are, even when all choices are predefined.
>
> Given that Gentoo has taken a whole different road by having more
> choices available to the user, I don't think the compilation results
> of one configuration would be easily used on another.

Is it possible to collect statistics of such configurations from Gentoo users?

I don't know what would the outcome be, but I think it's worth exploring.  E.g.
what if it turned out that there is not much diversity in our
settings?  E.g. we can find a few really popular clusters of USE, langauge,
license, flags?  As for hardware, what would be the latest backwards compatible
CPU that has compiled a binary for me with enough statistical confidence in its
reliability?


> To go even further, pushing your compiled packages to a public server
> may create a security risk by exposing many parts of your
> configuration that could be analyzed by malicious people.

Any example of such sensitive information that might be in the binaries?  Just
curious, as I don't know much about this.

I could be wrong, but so far my thought is that I don't think we get much bits
of entropy for our security by hiding our package lists, because I think an
adversary can probably already use statistics to predict common clusters of
package lists that we might use.s.

So I personally doubt that attackers would face much difficulty by not knowing
our packages, because our packages are probably already predictable since our
distribution of packages is not that diverse.


> So far I don't see a really big advantage in building this kind of
> infrastructure compared to either a binary distribution or Gentoo with
> home compilation.

IMO the real value is that it will be some kind of an automated community-driven
bin-host that uses statistics to quantify the reliability of its bins, and to

Re: [gentoo-user] almost free launch: an idea to lower build time, and rice, at the same time

2019-11-05 Thread Mickaël Bucas
Le mar. 5 nov. 2019 à 01:02, Caveman Al Toraboran
 a écrit :
>
>
> DISCLAIMER:  I am not claiming that this idea is new.  It is probably not new.
> ---  Even though some of its details might be new for a Linux
>  distribution, it's all based on boring well-established bits of
>  known science.  But regardless of its newness, I think it's worth
>  sharing with the hope that it may re-kindle the fire in a nerd's
>  heart (or a group of nerds) so that they develop this for me (or
>  us).
>
>
>
> GOAL:
> -
> Reduce compile time, rice (e.g. fancy USE, make.conf, etc), and yet not
> increase dev overhead.
>
>
> CURRENT SITUATION:
> --
> If you use *-bin packages, you cannot rice, and must compile on your own.
>
>
> THE APPROACH:
> -
> 1. Some nerd (or a group of nerds) makes (or make) a package, maybe call it
>`almostfreelunch.ebuild`.
>
> 2. Say you want to compile qtwebengine.  You do:   `almostfreelunch -aqvDuNt
>--backbrack=1000 qtwebengine`.
>
> 3. The app, `almostfreelunch`, will lookup your build setup (e.g.  USE flags,
>make.conf settings, etc) for all packages that you are about to build on
>your system as you are about to install that qtwebengine.
>
> 4. The app will upload that info to a central server, which  looks up the
>popularity of certain configurations.  E.g. see the distribution of
>compile-time configurations for a given package.  The central server will
>then figure out things like, qtwebengine is commonly compiled for x86-64
>with certain USE flags and other settings in make.conf.
>
> 5. If the server figures out that the package that `almostfreelunch` is about
>to compile is popular enough with the specific build settings that is about
>to happen, the server will reply to the app and tell it "hi, upload to me
>your bins when cooked, plz".  But if the build setting is not popular
>enough, it will reply "nothx".  This way, the central server will not end 
> up
>with too much undesired binaries with uncommon build-time settings.
>
> 6. The central server will also collect multiple binary packages from multiple
>people who use `almostfreelunch` for the same packages and the same
>build-time options.  I.e. multiple qtwebengine with identical build-time
>settings (e.g.  same USE flags, make.conf, etc).
>
> 7. The central server will perform statistical analysis against all of the
>uploaded binaries, of the same packages and the same claimed build-time
>settings, to cross-check those binaries to obtain a statistical confidence
>in identifying which of the binaries is the good one, and which ones are
>outliers outlier.  Outliers might exist because of users with buggy
>compilers, or malicious users that intentionally try to inject malware/bugs
>into their binaries.
>
> 8. Thanks to information theory, we will be able to figure out how much
>redundancy is needed in order to numerically calculate confidence value 
> that
>shows how trusty a given binary is.  E.g. if a package, with specific
>build-time options, as a very large number of binary submissions that are
>also extremely similar (i.e. only differ in trivial aspects due to certain
>randomness in how compilers work), then the central server can calculate a
>high confidence value for it.  Else, the confidence value drops.
>
> 9. If a user invokes `almostfreelunch -aqvDuNt --backbrack=1000 qtwebengine`
>and the central server tells the user that there is an already compiled
>package with the same settings, then the server simply tells the user, and
>shows him the confidence associated with the fitness of the binary (based 
> on
>calculations in stepss (6) to (8)).  By default, bins with too-low
>confidence values will be masked and proper colours will be used to
>adequately scare the users from low-confidence packages.
>
> 10. If at step (9) the user likes the confidence of the pre-compiled binary
>package, the user can simply download the binary package, blazing fast, 
> with
>all the nice UES and make.conf flags that he has.  Else, the user is free 
> to
>compile his own version, and upload his own binary, to help the server
>enhance its confidence as calculated in steps (6) to (8).
>
>
> NOTES:
> --
> * The statistical analysis in step (5) can also consider the compile time of
>   packages.  So the minimum popularity required for a specific package build 
> is
>   weighted while considering the total build time.  This way, too 
> slow-to-build
>   packages will end up getting a lower minimum popularity than those small
>   packages.  Choosing the sweet-spot trade-off is a matter of optimizing
>   resources of the central server.
>
> * The statistical analysis in steps (6) to (8) could also be further enhanced
>   by ranking individual users who upload the binaries.  Users, who upload 
>