Re: Verifying dep-5

2016-05-30 Thread Johannes Schauer
Hi,

Quoting Jakub Wilk (2016-05-30 13:08:47)
> * Johannes Schauer , 2016-05-28, 10:04:
> >I was investigating this problem last year and as far as my research 
> >went, there is no tracing method in existence which reliably traces 
> >system calls in general, file system access or read/write operations 
> >while keeping track of the acting pid that is 100% reliable. The 
> >methods I found either were not transparent (and would thus break test 
> >suites) or suffered from race conditions where it was possible to 
> >register an operation but miss the pid the operation was carried out by 
> >or dropped operations if they occurred with a too-high frequency...
> 
> Have you tried systemtap?

yes, and it will drop events if they arrive too fast. There is no way to
completely prevent it from doing so. One can only increase queue and buffer
sizes and timeouts but that will never provide 100% reliability.

cheers, josch


signature.asc
Description: signature


Re: Verifying dep-5

2016-05-30 Thread Jakub Wilk

* Johannes Schauer , 2016-05-28, 10:04:
I was investigating this problem last year and as far as my research 
went, there is no tracing method in existence which reliably traces 
system calls in general, file system access or read/write operations 
while keeping track of the acting pid that is 100% reliable. The 
methods I found either were not transparent (and would thus break test 
suites) or suffered from race conditions where it was possible to 
register an operation but miss the pid the operation was carried out by 
or dropped operations if they occurred with a too-high frequency...


Have you tried systemtap?

Timo Juhani Lindfors wrote PoC that tracks all execs:
http://lindi.iki.fi/lindi/structured-buildlogs/logs/hello-2.6-1_amd64.build
http://lindi.iki.fi/lindi/git/structured-buildlogs.git/

Having such a reliable tracing method would give us the ability to 
reliably infer copyright information


As Paul noticed in another mail, system calls tracing won't necessarily 
help much. 

as well as generating structured build logs (knowing for each line in 
the build log the process (tree) that created it).


Consider the following pipeline:

$ (LC_ALL=C date | tail -c2; echo 6) | shuf | head -n1 | tee log
6

Which process created the log line? The technically correct answer is 
"tee"; but this answer is completely impractical.


--
Jakub Wilk



Re: Verifying dep-5

2016-05-30 Thread Johannes Schauer
Hi,

Quoting Nikolaus Rath (2016-05-29 22:11:58)
> Did you write down your findings in some more detail somewhere?

no, sorry.

> I'd be curious why e.g. a LD_PRELOAD based wrapper would not work for all
> important cases.

For me "all important cases" were "compilation of all debian source packages".
LD_PRELOAD based methods would not work for for source packages which make use
of this mechanism already (for example during their tests). A prominent example
would be src:fakechroot itself.

> Or are we assuming that the application is actively trying to prevent this
> (and e.g. does system calls directly on its own)?

We are assuming that applications do things that they normally do during
package builds. Unfortunately that includes test cases which sometimes do
really weird things.

Using fakechroot or proot it would definitely be possible to set up such a
package building tracer that would work for 99% of the archive.

By building first without tracer, then with proot (on Linux) and then with
fakechroot (should the build fail with proot) and by then using reproducible
builds we can even make sure that the tracer did not influence the build in any
way that produces different binary packages. If test suits cannot be executed
because of the tracer, they will probably fail.

I did not follow-up on this 99% solution because I'm usually much less
motivated if the solution is not 100% proper. And there were some tricky things
to solve like what file format to make up to be able to store build logs and
operation on files while at the same time maintaining the process tree that
lead to writing to the build log or general file descriptor operations. And
since this information becomes a lot really quickly (a yaml based
representation I tested with easily reached several hundred of megabytes) it
would be great if the information could be written to the output file directly
instead of being stored in memory, but this then has to work even with parallel
builds.  There is still a sticky note about all these things on my fridge but
oh if I just would have more time... XD

Thanks!

cheers, josch


signature.asc
Description: signature


Re: Verifying dep-5

2016-05-29 Thread Nikolaus Rath
On May 28 2016, Johannes Schauer  wrote:
> Hi,
>
> Quoting Paul Wise (2016-05-28 06:45:44)
>> I think it would be interesting to automatically track how each file
>> in a binary package was created and which files they were derived
>> from. Then we could automatically generate proper copyright files for
>> binary packages. That is a hard project so...
>
> I was investigating this problem last year and as far as my research went,
> there is no tracing method in existence which reliably traces system calls in
> general, file system access or read/write operations while keeping track of 
> the
> acting pid that is 100% reliable. The methods I found either were not
> transparent (and would thus break test suites) or suffered from race 
> conditions
> where it was possible to register an operation but miss the pid the operation
> was carried out by or dropped operations if they occurred with a too-high
> frequency...

Did you write down your findings in some more detail somewhere? I'd be
curious why e.g. a LD_PRELOAD based wrapper would not work for all
important cases.

Or are we assuming that the application is actively trying to prevent
this (and e.g. does system calls directly on its own)?


Best,
Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«



Re: Verifying dep-5

2016-05-29 Thread Dmitry Bogatov
[2016-05-28 13:20] Stefano Zacchiroli 
> On Sat, May 28, 2016 at 02:18:51AM +0300, Dmitry Bogatov wrote:
> > But seems we do not have tools to check it. Probably, we need some way
> > to mark licenses of whole binary packages. WDYT?
> 
> You're correct that we have no way to document the licenses of binaries.
> The Policy is currently only concerned to document licenses at the
> source (files) level.
>
> Note that having a human-maintained documentation of the license of each
> binary we ship is not enough to properly do the checking you've in mind.
> Tracking licensing information across builds is actually an open
> research question on which various teams around the world are
> working---on various angles: formalizing dependencies across builds,
> dynamically tracking builds using syscall tapping, inspecting built
> binaries ex post, etc. There are prototypes of all these things around,
> but TTBOMK they are all very limited (e.g., restricting to a specific
> build system and/or a programming language) and as such by no mean
> generic enough to scale to the size and diversity we have in Debian.

In my particular case, issue is solved (upstream maintener agreed to remove
GPL file, causing package be plain BSD-3-clause). But to get idea, whether
such issue is worth new Field in d/control, it would be interesting to
take a look on all dep5 d/copyright files. Downloading every source package
in archive is not option, sure.

-- 
Accept: text/plain, text/x-diff
Accept-Language: eo,en,ru
X-Keep-In-CC: yes
X-Web-Site: sinsekvu.github.io



Re: Verifying dep-5

2016-05-28 Thread Paul Wise
On Sat, May 28, 2016 at 4:04 PM, Johannes Schauer wrote:

> Having such a reliable tracing method would give us the ability to reliably
> infer copyright information as well as generating structured build logs
> (knowing for each line in the build log the process (tree) that created it).
>
> Both of these would also tremendously help debugging problems. For example, 
> for
> fixing reproducible build problems, I was often puzzled which program actually
> created a file that I was interested in for a source package that I am not
> familiar with.

Thanks for these other use-cases, very interesting.

> Unfortunately though, there seems to be no way to reliably trace process
> execution and read/write/open/close system calls without either sometimes
> missing information or breaking builds...

I expect this would need some support from the kernel being run under.

OTOH I don't think a tracing mechanism is what is needed though, since
the kernel cannot know what the program is doing with each file being
read/written by the program. These sort of semantics (input, output
and code) are only known by the program that is doing the
transformations. Especially when you factor in shell scripts and other
things, the semantics get complicated. Kernel support would definitely
be useful though.

Perhaps we could have a brainstorm/BoF about this at a DebConf some time.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise



Re: Verifying dep-5

2016-05-28 Thread Stefano Zacchiroli
On Sat, May 28, 2016 at 02:18:51AM +0300, Dmitry Bogatov wrote:
> But seems we do not have tools to check it. Probably, we need some way
> to mark licenses of whole binary packages. WDYT?

You're correct that we have no way to document the licenses of binaries.
The Policy is currently only concerned to document licenses at the
source (files) level.

Note that having a human-maintained documentation of the license of each
binary we ship is not enough to properly do the checking you've in mind.
Tracking licensing information across builds is actually an open
research question on which various teams around the world are
working---on various angles: formalizing dependencies across builds,
dynamically tracking builds using syscall tapping, inspecting built
binaries ex post, etc. There are prototypes of all these things around,
but TTBOMK they are all very limited (e.g., restricting to a specific
build system and/or a programming language) and as such by no mean
generic enough to scale to the size and diversity we have in Debian.

Cheers.
-- 
Stefano Zacchiroli  . . . . . . .  z...@upsilon.cc . . . . o . . . o . o
Maître de conférences . . . . . http://upsilon.cc/zack . . . o . . . o o
Former Debian Project Leader . . . . . @zacchiro . . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »


signature.asc
Description: PGP signature


Re: Verifying dep-5

2016-05-28 Thread Johannes Schauer
Hi,

Quoting Paul Wise (2016-05-28 06:45:44)
> I think it would be interesting to automatically track how each file
> in a binary package was created and which files they were derived
> from. Then we could automatically generate proper copyright files for
> binary packages. That is a hard project so...

I was investigating this problem last year and as far as my research went,
there is no tracing method in existence which reliably traces system calls in
general, file system access or read/write operations while keeping track of the
acting pid that is 100% reliable. The methods I found either were not
transparent (and would thus break test suites) or suffered from race conditions
where it was possible to register an operation but miss the pid the operation
was carried out by or dropped operations if they occurred with a too-high
frequency...

Having such a reliable tracing method would give us the ability to reliably
infer copyright information as well as generating structured build logs
(knowing for each line in the build log the process (tree) that created it).

Both of these would also tremendously help debugging problems. For example, for
fixing reproducible build problems, I was often puzzled which program actually
created a file that I was interested in for a source package that I am not
familiar with.

Unfortunately though, there seems to be no way to reliably trace process
execution and read/write/open/close system calls without either sometimes
missing information or breaking builds...

cheers, josch


signature.asc
Description: signature


Re: Verifying dep-5

2016-05-28 Thread Jonas Smedegaard
Quoting Dmitry Bogatov (2016-05-28 07:47:31)
>
> [add debian-devel back to cc]
>
>> Regarding _declaring_ appropriate DEP5 hints, with machine-readable 
>> DEP5 = copyright format you can declare a license in the _header_ 
>> section to = indicate the effective license caused by "infection" of 
>> indivifual parts = on the whole of the binary product.
>
> Almost sufficent, but not general enough.

I don't follow, but instead of elaborating further here, see below...


> Just an idea: new field in Package: stanza in d/control: 
> `Effective-License', which specify which terms you must comply with if 
> you use this library. In my case, I would leave debian/copyright 
> alone, and add `Effective-License: GPL-2+' to libghc-missingh-dev.
>
> And add rule, that Effective-License defaults to License in header,
> which defaults to the most strict of licenses of individual files.
> Add tool, that implement this rule. Hmm, it is complicated.
>
> Thoughts?
>
>> Also note that DEP5 format is only optional, so such automated = 
>> checks, even if/when existing, would not cover Debian as a whole.
>
> Is there no plans to push it into policy?

I guess further progress to copyright format is driven by bugreports 
against debian-policy.  Therefore I suggest you to file a bugreport if 
you feel there is substance for change.

Since generally Policy reflects reality of Debian rather than steering 
changes to it, you might consider "backing" such bugreport by active use 
of your proposed new field: Copyright format explicitly permit the use 
of unofficial fields.

 - Jonas

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private


signature.asc
Description: signature


Re: Verifying dep-5

2016-05-27 Thread Dmitry Bogatov

[add debian-devel back to cc]

> Regarding _declaring_ appropriate DEP5 hints, with machine-readable DEP5 =
> copyright format you can declare a license in the _header_ section to =
> indicate the effective license caused by "infection" of indivifual parts =
> on the whole of the binary product.

Almost sufficent, but not general enough.

Just an idea: new field in Package: stanza in d/control:
`Effective-License', which specify which terms you must comply with if
you use this library. In my case, I would leave debian/copyright alone,
and add `Effective-License: GPL-2+' to libghc-missingh-dev.

And add rule, that Effective-License defaults to License in header,
which defaults to the most strict of licenses of individual files.
Add tool, that implement this rule. Hmm, it is complicated.

Thoughts?

> Also note that DEP5 format is only optional, so such automated =
> checks, even if/when existing, would not cover Debian as a whole.

Is there no plans to push it into policy?

-- 
Accept: text/plain, text/x-diff
Accept-Language: eo,en,ru
X-Keep-In-CC: yes
X-Web-Site: sinsekvu.github.io



Re: Verifying dep-5

2016-05-27 Thread Paul Wise
On Sat, May 28, 2016 at 7:18 AM, Dmitry Bogatov wrote:

> Do we have any tools to check for GPL violation? I mean, is it any
> tool to perform rather crude check whether package that contains
> non-copyleft source file depends on binary package, source package of
> which contains GPL file?

non-copyleft licenses are generally GPL compatible, but I guess you
are thinking of BSD-4-clause and OpenSSL licenses here? There are
GPL-incompatible copyleft licenses too (like CDDL).

The adequate tool can perform some checking of license incompatibilities:

https://piuparts.debian.org/sid/incompatible_licenses_inadequate_issue.html
https://packages.debian.org/unstable/adequate

> Currently, I am working about some issue with haskell-missingh.  All
> code in this package is BSD-3-clause, but one file is GPL.  It would
> be wrong to mark all files as GPL, but package as whole is GPL, which
> should be propagated down the dependency tree. But seems we do not
> have tools to check it. Probably, we need some way to mark licenses
> of whole binary packages. WDYT?

I think it would be interesting to automatically track how each file
in a binary package was created and which files they were derived
from. Then we could automatically generate proper copyright files for
binary packages. That is a hard project so...

The next best thing is to have a manually prepared copyright file for
the binary package that is different to the one for the source package
(see libicns for an example) but...

Right now we completely ignore what the correct copyright/license
situation is for binary packages and assume it is the same as for the
source package.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise



Verifying dep-5

2016-05-27 Thread Dmitry Bogatov

Hello!

Do we have any tools to check for GPL violation? I mean, is it any
tool to perform rather crude check whether package that contains
non-copyleft source file depends on binary package, source package of
which contains GPL file?

Currently, I am working about some issue with haskell-missingh.  All
code in this package is BSD-3-clause, but one file is GPL.  It would
be wrong to mark all files as GPL, but package as whole is GPL, which
should be propagated down the dependency tree. But seems we do not
have tools to check it. Probably, we need some way to mark licenses
of whole binary packages. WDYT?

-- 
Accept: text/plain, text/x-diff
Accept-Language: eo,en,ru
X-Keep-In-CC: yes
X-Web-Site: sinsekvu.github.io