Hi Tony,



Am 21. Juli 2023 um 08:08 schrieb tony mancill:
[...]
> I haven't uploaded yet because I am not yet sure how (or whether it is
> even necessary) to document the license and copyright of a few of the
> test resources.  In particular, these files:
> 
> Files: jwat-arc/src/test/resources/IAH-20080430204825-00000-blackbook.arc
>        jwat-arc/src/test/resources/IAH-20080430204825-00000-blackbook.arc.gz
>        jwat-gzip/src/test/resources/IAH-20080430204825-00000-blackbook.warc
>        jwat-gzip/src/test/resources/IAH-20080430204825-00000-blackbook.warc.gz
>        jwat-warc/src/test/resources/IAH-20080430204825-00000-blackbook.warc
>        jwat-warc/src/test/resources/IAH-20080430204825-00000-blackbook.warc.gz

This seems to be the original source of those files:
https://archive.org/download/ExampleArcAndWarcFiles/
(See also https://archive.org/details/ExampleArcAndWarcFiles)

> 
> For which the decopy [2] utility generates a very messy copyright
> entry that ends with:
> 
>     License: CC-BY-NC-SA-ND-3 or Expat or GPL or LGPL-2.1+
> 
> It's conceivable that these WARC [3] files contain copyrighted materials
> and that uploading them as components of the source package would be
> considered redistribution, but I am admittedly not well-versed enough in
> this area to say for sure without looking into the contents in more
> detail.

Opening IAH-20080430204825-00000-blackbook.warc in an editor reveals
that it contains a webcrawl of archive.org (or part of it). It does
include many files of different formats and media types partly carrying
their own license information. This is why decopy lists so many
different licenses.

There might be false positives, though. This is because the warc file
contains web pages listing details about other resources including
license information. At least some of those resources are not included
in the warc file themselves, so the license might actually not be
applicable to any material in the warc file.

Passing the term "-nd" to the editor's search function produces good
examples. The first occurrence appears on a site providing details about
some podcast which is licensed CC-BY-NC-ND. The podcast itself, however,
is not part of the warc file. Unfortunately, there are quite a few
matches of "-nd" that would need checking and I haven't worked out a
good approach to make this actually feasible. Here is one interesting
observation though:

$ grep -ae "^Content-Type:" IAH-20080430204825-00000-blackbook.warc \
    | cut -d' ' -f2 | sort | uniq
application/http;
application/warc-fields
application/x-javascript
application/x-shockwave-flash
image/gif
image/jpeg
image/png
text/anvl
text/css
text/dns
text/html
text/html;
text/plain
text/plain;
text/xml

In particular, a lot of Content-Types are missing from this list in
relation to the resources mentioned as being licensed under some
CC-BY-ND license.

Since this is from archive.org, their terms of service apply:
https://archive.org/about/terms.php

This might turn out to be a bit to restrictive for DFSG, since it
includes this passage:
    Access to the Archive’s Collections is provided at no cost to you
    and is granted for scholarship and research purposes only.

It makes sense for them to take this rather defensive approach since
they provide a lot of content from different sources. On the other hand,
the warc file appears to be intentionally prepared for testing and demo
purposes and uploaded by someone at archive.org. That is why I had hoped
for a more permissive license, but could not find any indication of it.

> 
> It would be nice to be able to (a) use the files as-is so that we
> don't have to either (b) remove the files and disable tests, or (c)
> replace the files and rewrite the tests that access them. I
> spot-checked a few tests and they appear to expect to be able to
> locate specific contents in the archive, so (c) would be non-trivial
> and could result in the package being quite difficult to maintain over
> time, since any upstream changes to those tests would require updating
> the patch(es).
> 
> Let me know if you have any thoughts on this.  Otherwise, I will follow
> up once I have a chance to look through the test resources in more
> detail.

Contacting archive.org and asking for license clarification might be an
option. I am not sure whether I would hold my breath, but it seems to me
that removing the files in question might turn out to be the only
alternative. Then again, we could disable the test suite, after all, so
the build would not depend on the presence of those files.

Woud do you thin?

Best wishes,

Elias

> 
> Thank you, tony
> 
> [1] https://salsa.debian.org/java-team/libjwat-java
> [2] https://tracker.debian.org/pkg/decopy
> [3] https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml
> 

Reply via email to