Re: A systematic approach to IP review?

Marcus (OOo) Mon, 19 Sep 2011 10:19:53 -0700

Am 09/19/2011 06:54 PM, schrieb Rob Weir:

On Mon, Sep 19, 2011 at 12:35 PM, Dennis E. Hamilton
<dennis.hamil...@acm.org>  wrote:

Rob,


I was reading the suggestion from Marcus as it being that since the code base 
is in a folder structure (modularized) and the wiki can map folder structures 
and their status nicely, it is not necessary to have a single table to manage 
this from, but have any tables be at some appropriate granularity toward the 
leaves of the hierarchy (on the wiki).


Using the wiki for this might be useful for tracking the status of
modules we already know we need to replace.  Bugzilla would be another
way to track the status.


How do you want to use Bugzilla to track thousands of files?

But it is not really a sufficient solution.  Why?  Because it is not
tied to the code and is not reproducible.  How was the list of
components listed in the wiki generated?  Based on what script?  Where
is the script?  How do we know it is accurate and current?  How do we
know that integrating a CWS does not make that list become outdated?
How do we prove to ourselves that we did this right?  And how to we
record that proof as a record?  And how do we repeat this proof every
time we do a new release?


Questions over questions but not helpful. ;-)

A list of components of unknown derivation sitting on a community wiki
that anyone can edit is not really a suitable basis for an IP review.


Then restrict the write access.

The granularity we need to worry about is the file.  That is the
finest grain level of having a license header.  That is the unit of
tracking in SVN.  That is the unit that someone could have changed the
content in SVN.

Again, it is fine if someone wants to outline this at the module
level.  But that does not eliminate the requirement for us to do this
at the file level as well.


IMHO you haven't understood what I wanted to tell you.

Sure it makes no sense to create a list of every file in SVN to see ifthe license is good or bad. So, do it module by module. And when amodule is marked as "done", then of course every file in the modules waschecked. Otherwise it's not working.

And how to make sure that there was no change when source wasadded/moved/improved? Simply Commit Then Review (CTR). A change in thelicense header at the beginning should be remarkable, right? However, wealso need to have trust in everybodies work.


BTW:
What is your plan to track every file to make sure the license is OK?

Marcus

I can see some brittle cases, especially in the face of refactoring.  The use 
of the wiki might have to be an ephemeral activity that is handled this way 
entirely for our initial scrubbing.

Ideally, additional and sustained review would be in the SVN with the artifacts 
so reviewed, and coalesced somehow.  The use of SVN properties is interesting, 
but they are rather invisible and I have a question about what happens with 
them when a commit happens against the particular artifact.


Properties stick with the file, unless changed.  Think of the
svn:eol-style property.  It is not wiped out with a new revision of
the file.

It seems that there is some need to balance an immediate requirement and what 
would be sufficient for it versus what would assist us in the longer term.  It 
would be interesting to know what the additional-review work has become for 
other projects that have a substantial code base (e.g., SVN itself, httpd, 
...).  I have no idea.


The IP review needs to occur with every release.  So the work we do to
automate this, and make it data-drive, will repay itself with every
release.

I invite you to investigate what other projects do.  When you do I
think you will agree.

  - Dennis

-----Original Message-----
From: Rob Weir [mailto:robw...@apache.org]
Sent: Monday, September 19, 2011 07:47
To: ooo-dev@incubator.apache.org
Subject: Re: A systematic approach to IP review?

On Mon, Sep 19, 2011 at 8:13 AM, Marcus (OOo)<marcus.m...@wtnet.de>  wrote:

Am 09/19/2011 01:59 PM, schrieb Rob Weir:


2011/9/19 Jürgen Schmidt<jogischm...@googlemail.com>:


On Mon, Sep 19, 2011 at 2:27 AM, Rob Weir<robw...@apache.org>    wrote:

If you haven't looked it closely, it is probably worth a few minutes
of your time to review our incubation status page, especially the
items under "Copyright" and "Verify Distribution Rights".  It lists
the things we need to do, including:

  -- Check and make sure that the papers that transfer rights to the
ASF been received. It is only necessary to transfer rights for the
package, the core code, and any new code produced by the project.

-- Check and make sure that the files that have been donated have been
updated to reflect the new ASF copyright.

-- Check and make sure that for all code included with the
distribution that is not under the Apache license, we have the right
to combine with Apache-licensed code and redistribute.

-- Check and make sure that all source code distributed by the project
is covered by one or more of the following approved licenses: Apache,
BSD, Artistic, MIT/X, MIT/W3C, MPL 1.1, or something with essentially
the same terms.

Some of this is already going on, but it is hard to get a sense of who
is doing what and how much progress we have made.  I wonder if we can
agree to a more systematic approach?  This will make it easier to see
the progress we're making and it will also make it easier for others
to help.

Suggestions:

1) We need to get all files needed for the build into SVN.  Right now
there are some that are copied down from the OpenOffice.org website
during the build's bootstrap process.   Until we get the files all in
one place it is hard to get a comprehensive view of our dependencies.


do you mean to check in the files under ext_source into svn and remove it
later on when we have cleaned up the code. Or do you mean to put it
somehwere on apache extras?
I would prefer to save these binary files under apache extra if possible.



Why not just keep in in SVN?   Moving things to Apache-Extras does not
help us with the IP review.   In other words, if we have a dependency
on a OSS module that has an incompatible license, then moving that
module to Apache Extras does not make that dependency go away.  We
still need to understand the nature of the dependency: a build tool, a
dynamic runtime dependency, a statically linked library, an optional
extensions, a necessary core module.

If we find out, for example, that something in ext-sources is only
used as a build tool, and is not part of the release, then there is
nothing that prevents us from hosting it in SVN.   But if something is
a necessary library and it is under GPL, then this is a problem even
if we store it on Apache-Extras,


2) Continue the CWS integrations.  Along with 1) this ensures that all
the code we need for the release is in SVN.

3)  Files that Oracle include in their SGA need to have the Apache
license header inserted and the Sun/Oracle copyright migrated to the
NOTICE file.  Apache RAT (Release Audit Tool) [2] can be used to
automate parts of this.

4) Once the SGA files have the Apache headers, then we can make
regular use of RAT to report on files that are lacking an Apache
header.  Such files might be in one of the following categories:

a) Files that Oracle owns the copyright on and which should be
included in an amended SGA

b) Files that have a compatible OSS license which we are permitted to
use.  This might require that we add a mention of it to the NOTICE
file.

c) Files that have an incompatible OSS license.  These need to be
removed/replaced.

d) Files that have an OSS license that has not yet been
reviewed/categorized by Apache legal affairs.  In that case we need to
bring it to their attention.

e) (Hypothetically) files that are not under an OSS license at all.
E.g., a Microsoft header file.  These must be removed.

5) We should to track the resolution of each file, and do this
publicly.  The audit trail is important.  Some ways we could do this
might be:

a) Track this in SVN properties.  So set ip:sga for the SGA files,
ip:mit for files that are MIT licensed, etc.  This should be reflected
in headers as well, but this is not always possible.  For example, we
might have binary files where we cannot add headers, or cases where
the OSS files do not have headers, but where we can prove their
provenance via other means.

b) Track this is a spreadsheet, one row per file.

c) Track this is an text log file checked in SVN

d) Track this in an annotated script that runs RAT, where the
annotations document the reason for cases where we tell it to ignore a
file or directory.

6) Iterate until we have a clean RAT report.

7) Goal should be for anyone today to be able to see what work remains
for IP clearance, as well as for someone 5 years from now to be able
to tell what we did.  Tracking this on the community wiki is probably
not good enough, since we've previously talked about dropping that
wiki and going to MWiki.


talked about it yes but did we reached a final decision?

The migrated wiki is available under http://ooo-wiki.apache.org/wiki and
can
be used. Do we want to continue with this wiki now? It's still not clear
for
me at the moment.

But we need a place to document the IP clearance and under
http://ooo-wiki.apache.org/wiki/ApacheMigration we have already some
information.


This is not really sufficient. The wiki is talking about module-level
dependencies.   This is a good star and useful for the high level
discussion. But we need to look file-by-file.  We need to catch the
case where (hypothetically) there is a single GPL header file sitting
in a core OOo source directory.  So we need to review 100,000's of
files.  Too big for a table on the wiki.


If you think in files than yes, it's too big.

But when you split this up into the application modules and submodules and
sub-sub-modules, then different people can work in parallel when it's known
who is working in what module.


We don't really have a comprehensive view of the licenses in the
source tree until we do a file-by-file scan.  Until we do that we just
have an approximation.

But once we have a detailed view, then it is natural to work on the
larger chunks module-by-module.  Most files we need to worry about
will be in a module where we will treat all files in that module the
same way.  But until proven otherwise, we need to be alert to the
possibility that there is a single non-OSS Microsoft header file
sitting in a directory someplace.  I'm not saying this has actually
happened, or that it is likely to have happened.  I'm just saying that
our review needs to be detailed enough that we can catch such a
problem if it occurs.

IMHO this should work and there is always an actual and current overview.

Marcus

Note also that doing this kind of check is a per-requisite for every
release we do at Apache.  So agreeing on what tools and techniques we
want to use for this process is important.  If we do it right, the
next time we do a review it will be very fast and easy, since we'll be
able to build upon the review we've already done. That's why I think
that either using svn properties or scripts with annotated data files
listing "cleared" files is the best approach.  Make the review process
be data-driven and reproducible using automated tools.  It won't
totally eliminate the need for manual inspection, but it will: 1) Help
parallelize that effort, and 2) Ensure it is only done once per file.

Juergen



-Rob


[1] http://incubator.apache.org/projects/openofficeorg.html

[2] http://incubator.apache.org/rat/

Re: A systematic approach to IP review?

Reply via email to