Bug#1041488: Please review patches for initial upload

2023-08-09 Thread tony mancill
Hello Elias,

On Mon, Jul 24, 2023 at 10:41:02PM +0200, Elias Oltmanns wrote:
> Am 21. Juli 2023 um 08:08 schrieb tony mancill:
> [...]
> 
> Contacting archive.org and asking for license clarification might be an
> option. I am not sure whether I would hold my breath, but it seems to me
> that removing the files in question might turn out to be the only
> alternative. Then again, we could disable the test suite, after all, so
> the build would not depend on the presence of those files.

I intend to remove files for which we don't have clear licenses from the
DFSG repacked tarball and disable the tests that depend upon them.
Apologies for the delay here - it will be a few days yet before I will
have the package for an upload.

Thank you,
tony


signature.asc
Description: PGP signature


Bug#1041488: Please review patches for initial upload

2023-07-24 Thread Elias Oltmanns
Hi Tony,




Am 21. Juli 2023 um 08:08 schrieb tony mancill:
[...]
> I haven't uploaded yet because I am not yet sure how (or whether it is
> even necessary) to document the license and copyright of a few of the
> test resources.  In particular, these files:
> 
> Files: jwat-arc/src/test/resources/IAH-20080430204825-0-blackbook.arc
>jwat-arc/src/test/resources/IAH-20080430204825-0-blackbook.arc.gz
>jwat-gzip/src/test/resources/IAH-20080430204825-0-blackbook.warc
>jwat-gzip/src/test/resources/IAH-20080430204825-0-blackbook.warc.gz
>jwat-warc/src/test/resources/IAH-20080430204825-0-blackbook.warc
>jwat-warc/src/test/resources/IAH-20080430204825-0-blackbook.warc.gz

This seems to be the original source of those files:
https://archive.org/download/ExampleArcAndWarcFiles/
(See also https://archive.org/details/ExampleArcAndWarcFiles)

> 
> For which the decopy [2] utility generates a very messy copyright
> entry that ends with:
> 
> License: CC-BY-NC-SA-ND-3 or Expat or GPL or LGPL-2.1+
> 
> It's conceivable that these WARC [3] files contain copyrighted materials
> and that uploading them as components of the source package would be
> considered redistribution, but I am admittedly not well-versed enough in
> this area to say for sure without looking into the contents in more
> detail.

Opening IAH-20080430204825-0-blackbook.warc in an editor reveals
that it contains a webcrawl of archive.org (or part of it). It does
include many files of different formats and media types partly carrying
their own license information. This is why decopy lists so many
different licenses.

There might be false positives, though. This is because the warc file
contains web pages listing details about other resources including
license information. At least some of those resources are not included
in the warc file themselves, so the license might actually not be
applicable to any material in the warc file.

Passing the term "-nd" to the editor's search function produces good
examples. The first occurrence appears on a site providing details about
some podcast which is licensed CC-BY-NC-ND. The podcast itself, however,
is not part of the warc file. Unfortunately, there are quite a few
matches of "-nd" that would need checking and I haven't worked out a
good approach to make this actually feasible. Here is one interesting
observation though:

$ grep -ae "^Content-Type:" IAH-20080430204825-0-blackbook.warc \
| cut -d' ' -f2 | sort | uniq
application/http;
application/warc-fields
application/x-javascript
application/x-shockwave-flash
image/gif
image/jpeg
image/png
text/anvl
text/css
text/dns
text/html
text/html;
text/plain
text/plain;
text/xml

In particular, a lot of Content-Types are missing from this list in
relation to the resources mentioned as being licensed under some
CC-BY-ND license.

Since this is from archive.org, their terms of service apply:
https://archive.org/about/terms.php

This might turn out to be a bit to restrictive for DFSG, since it
includes this passage:
Access to the Archive’s Collections is provided at no cost to you
and is granted for scholarship and research purposes only.

It makes sense for them to take this rather defensive approach since
they provide a lot of content from different sources. On the other hand,
the warc file appears to be intentionally prepared for testing and demo
purposes and uploaded by someone at archive.org. That is why I had hoped
for a more permissive license, but could not find any indication of it.

> 
> It would be nice to be able to (a) use the files as-is so that we
> don't have to either (b) remove the files and disable tests, or (c)
> replace the files and rewrite the tests that access them. I
> spot-checked a few tests and they appear to expect to be able to
> locate specific contents in the archive, so (c) would be non-trivial
> and could result in the package being quite difficult to maintain over
> time, since any upstream changes to those tests would require updating
> the patch(es).
> 
> Let me know if you have any thoughts on this.  Otherwise, I will follow
> up once I have a chance to look through the test resources in more
> detail.

Contacting archive.org and asking for license clarification might be an
option. I am not sure whether I would hold my breath, but it seems to me
that removing the files in question might turn out to be the only
alternative. Then again, we could disable the test suite, after all, so
the build would not depend on the presence of those files.

Woud do you thin?

Best wishes,

Elias

> 
> Thank you, tony
> 
> [1] https://salsa.debian.org/java-team/libjwat-java
> [2] https://tracker.debian.org/pkg/decopy
> [3] https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml
> 



Bug#1041488: Please review patches for initial upload

2023-07-21 Thread tony mancill
Hello Elias,

On Wed, Jul 19, 2023 at 11:49:46PM +0200, Elias Oltmanns wrote:
> Sorry for all those emails, but I have just realised that
> debian/README.source needed fixing. The reason is that I started out
> with the test suite disabled but have managed to get it running after
> all.
> 
> So, I have added another patch to the previous two and will append all
> three to this message.

I have created a git repository for the Debian packaging [1] and started
reviewing the package.  Everything looks good from the standpoint of
constructing the .deb.

I haven't uploaded yet because I am not yet sure how (or whether it is
even necessary) to document the license and copyright of a few of the
test resources.  In particular, these files:

Files: jwat-arc/src/test/resources/IAH-20080430204825-0-blackbook.arc
   jwat-arc/src/test/resources/IAH-20080430204825-0-blackbook.arc.gz
   jwat-gzip/src/test/resources/IAH-20080430204825-0-blackbook.warc
   jwat-gzip/src/test/resources/IAH-20080430204825-0-blackbook.warc.gz
   jwat-warc/src/test/resources/IAH-20080430204825-0-blackbook.warc
   jwat-warc/src/test/resources/IAH-20080430204825-0-blackbook.warc.gz

For which the decopy [2] utility generates a very messy copyright entry
that ends with:

License: CC-BY-NC-SA-ND-3 or Expat or GPL or LGPL-2.1+

It's conceivable that these WARC [3] files contain copyrighted materials
and that uploading them as components of the source package would be
considered redistribution, but I am admittedly not well-versed enough in
this area to say for sure without looking into the contents in more
detail.

It would be nice to be able to (a) use the files as-is so that we don't
have to either (b) remove the files and disable tests, or (c) replace
the files and rewrite the tests that access them.  I spot-checked a few
tests and they appear to expect to be able to locate specific contents
in the archive, so (c) would be non-trivial and could result in the
package being quite difficult to maintain over time, since any upstream
changes to those tests would require updating the patch(es).

Let me know if you have any thoughts on this.  Otherwise, I will follow
up once I have a chance to look through the test resources in more
detail.

Thank you,
tony

[1] https://salsa.debian.org/java-team/libjwat-java
[2] https://tracker.debian.org/pkg/decopy
[3] https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml


signature.asc
Description: PGP signature


Bug#1041488: Please review patches for initial upload

2023-07-19 Thread Elias Oltmanns
Sorry for all those emails, but I have just realised that
debian/README.source needed fixing. The reason is that I started out
with the test suite disabled but have managed to get it running after
all.

So, I have added another patch to the previous two and will append all
three to this message.

Cheers,

Elias


Am 19. Juli 2023 um 23:16 schrieb Elias Oltmanns:
> Package: wnpp
> Followup-For: Bug #1041488
> X-Debbugs-Cc: oltma...@zib.de
> Control: tags -1 patch
> 
> Please have a look at the following patches. They might be suitable as
> an initial seed for a salsa repository for the new package. Please apply
> and simply pull in the upstream sources by means of uscan.
> 
> Thank you in advance for any support you can provide.
> 
> Cheers,
> 
> Elias
> 
> 
> 
>From d7ccbe9e1bdbdb0eb25a61b33953e6d68c7e78cb Mon Sep 17 00:00:00 2001
From: Elias Oltmanns 
Date: Wed, 19 Jul 2023 19:39:57 +0200
Subject: [PATCH 1/3] Initial commit

Closes: #1041488
---
 debian/README.source | 13 +
 debian/control   | 33 +
 debian/copyright | 29 +
 debian/libjwat-java.poms | 35 +++
 debian/maven.ignoreRules | 14 ++
 debian/maven.rules   | 10 ++
 debian/rules |  4 
 debian/source/format |  1 +
 debian/watch |  2 ++
 9 files changed, 141 insertions(+)
 create mode 100644 debian/README.source
 create mode 100644 debian/control
 create mode 100644 debian/copyright
 create mode 100644 debian/libjwat-java.poms
 create mode 100644 debian/maven.ignoreRules
 create mode 100644 debian/maven.rules
 create mode 100755 debian/rules
 create mode 100644 debian/source/format
 create mode 100644 debian/watch

diff --git a/debian/README.source b/debian/README.source
new file mode 100644
index 000..d6f8170
--- /dev/null
+++ b/debian/README.source
@@ -0,0 +1,13 @@
+Information about libjwat-java
+--
+
+This package was debianized using the mh_make command
+from the maven-debian-helper package.
+
+The build system uses Maven but prevents it from downloading
+anything from the Internet, making the build compliant with
+the Debian policy.
+
+Running the test suite at build time has been disabled in
+debian/maven.properties. This is due to dependencies that have not
+been packaged for Debian.
diff --git a/debian/control b/debian/control
new file mode 100644
index 000..50828a5
--- /dev/null
+++ b/debian/control
@@ -0,0 +1,33 @@
+Source: libjwat-java
+Section: java
+Priority: optional
+Maintainer: Debian Java Maintainers 
+Build-Depends:
+ debhelper-compat (= 13),
+ default-jdk,
+ maven-debian-helper (>= 2.1),
+ junit4 (>= 4.13.2),
+Build-Depends-Indep:
+ libbcprov-java (>= 1.65),
+ libdoxia-java (>= 1.7),
+ libmaven-compiler-plugin-java (>= 3.10.1),
+ libmaven-javadoc-plugin-java (>= 3.4.1),
+ libmaven-site-plugin-java (>= 3.12.1),
+ libsurefire-java (>= 2.22.3),
+ libhamcrest-java,
+ libmockito-java,
+ libpowermock-java
+Standards-Version: 4.6.2
+Homepage: https://sbforge.org/display/JWAT/JWAT
+Rules-Requires-Root: no
+
+Package: libjwat-java
+Architecture: all
+Depends: ${misc:Depends}, ${maven:Depends}
+Suggests: ${maven:OptionalDepends}
+Multi-Arch: foreign
+Description: Java Web Archive Toolkit
+ A collection of libraries to use for reading, writing and validating ARC,
+ WARC and GZip files. Also includes various helper classes to help with
+ different types of input streams. Finally there are also classes to help
+ with HTTP, character encoding and other Internet related protocols.
diff --git a/debian/copyright b/debian/copyright
new file mode 100644
index 000..9fa40f9
--- /dev/null
+++ b/debian/copyright
@@ -0,0 +1,29 @@
+Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
+Upstream-Name: Java Web Archive Toolkit
+Upstream-Contact: https://sbforge.org/display/JWAT/JWAT
+Source: https://github.com/netarchivesuite/jwat
+
+Files: *
+Copyright:
+ 2011-2023, Det Kongelige Bibliotek/Royal Danish Library (https://www.kb.dk/)
+License: Apache-2.0
+
+Files: debian/*
+Copyright: 2023, Zuse Institute Berlin (https://ewig.zib.de/)
+License: Apache-2.0
+
+License: Apache-2.0
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+ .
+ http://www.apache.org/licenses/LICENSE-2.0
+ .
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ .
+ On Debian systems, the full text of the Apache-2.0 license
+ can be found in the file '/usr/share/common-licenses/Apache-2.0'
diff --git a/debian/libjwat-java.poms b/debian/libjwat-java.poms
new file mode 

Bug#1041488: Please review patches for initial upload

2023-07-19 Thread Elias Oltmanns
Package: wnpp
Followup-For: Bug #1041488
X-Debbugs-Cc: oltma...@zib.de
Control: tags -1 patch

Please have a look at the following patches. They might be suitable as
an initial seed for a salsa repository for the new package. Please apply
and simply pull in the upstream sources by means of uscan.

Thank you in advance for any support you can provide.

Cheers,

Elias
>From d7ccbe9e1bdbdb0eb25a61b33953e6d68c7e78cb Mon Sep 17 00:00:00 2001
From: Elias Oltmanns 
Date: Wed, 19 Jul 2023 19:39:57 +0200
Subject: [PATCH 1/2] Initial commit

Closes: #1041488
---
 debian/README.source | 13 +
 debian/control   | 33 +
 debian/copyright | 29 +
 debian/libjwat-java.poms | 35 +++
 debian/maven.ignoreRules | 14 ++
 debian/maven.rules   | 10 ++
 debian/rules |  4 
 debian/source/format |  1 +
 debian/watch |  2 ++
 9 files changed, 141 insertions(+)
 create mode 100644 debian/README.source
 create mode 100644 debian/control
 create mode 100644 debian/copyright
 create mode 100644 debian/libjwat-java.poms
 create mode 100644 debian/maven.ignoreRules
 create mode 100644 debian/maven.rules
 create mode 100755 debian/rules
 create mode 100644 debian/source/format
 create mode 100644 debian/watch

diff --git a/debian/README.source b/debian/README.source
new file mode 100644
index 000..d6f8170
--- /dev/null
+++ b/debian/README.source
@@ -0,0 +1,13 @@
+Information about libjwat-java
+--
+
+This package was debianized using the mh_make command
+from the maven-debian-helper package.
+
+The build system uses Maven but prevents it from downloading
+anything from the Internet, making the build compliant with
+the Debian policy.
+
+Running the test suite at build time has been disabled in
+debian/maven.properties. This is due to dependencies that have not
+been packaged for Debian.
diff --git a/debian/control b/debian/control
new file mode 100644
index 000..50828a5
--- /dev/null
+++ b/debian/control
@@ -0,0 +1,33 @@
+Source: libjwat-java
+Section: java
+Priority: optional
+Maintainer: Debian Java Maintainers 
+Build-Depends:
+ debhelper-compat (= 13),
+ default-jdk,
+ maven-debian-helper (>= 2.1),
+ junit4 (>= 4.13.2),
+Build-Depends-Indep:
+ libbcprov-java (>= 1.65),
+ libdoxia-java (>= 1.7),
+ libmaven-compiler-plugin-java (>= 3.10.1),
+ libmaven-javadoc-plugin-java (>= 3.4.1),
+ libmaven-site-plugin-java (>= 3.12.1),
+ libsurefire-java (>= 2.22.3),
+ libhamcrest-java,
+ libmockito-java,
+ libpowermock-java
+Standards-Version: 4.6.2
+Homepage: https://sbforge.org/display/JWAT/JWAT
+Rules-Requires-Root: no
+
+Package: libjwat-java
+Architecture: all
+Depends: ${misc:Depends}, ${maven:Depends}
+Suggests: ${maven:OptionalDepends}
+Multi-Arch: foreign
+Description: Java Web Archive Toolkit
+ A collection of libraries to use for reading, writing and validating ARC,
+ WARC and GZip files. Also includes various helper classes to help with
+ different types of input streams. Finally there are also classes to help
+ with HTTP, character encoding and other Internet related protocols.
diff --git a/debian/copyright b/debian/copyright
new file mode 100644
index 000..9fa40f9
--- /dev/null
+++ b/debian/copyright
@@ -0,0 +1,29 @@
+Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
+Upstream-Name: Java Web Archive Toolkit
+Upstream-Contact: https://sbforge.org/display/JWAT/JWAT
+Source: https://github.com/netarchivesuite/jwat
+
+Files: *
+Copyright:
+ 2011-2023, Det Kongelige Bibliotek/Royal Danish Library (https://www.kb.dk/)
+License: Apache-2.0
+
+Files: debian/*
+Copyright: 2023, Zuse Institute Berlin (https://ewig.zib.de/)
+License: Apache-2.0
+
+License: Apache-2.0
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+ .
+ http://www.apache.org/licenses/LICENSE-2.0
+ .
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ .
+ On Debian systems, the full text of the Apache-2.0 license
+ can be found in the file '/usr/share/common-licenses/Apache-2.0'
diff --git a/debian/libjwat-java.poms b/debian/libjwat-java.poms
new file mode 100644
index 000..671e512
--- /dev/null
+++ b/debian/libjwat-java.poms
@@ -0,0 +1,35 @@
+# List of POM files for the package
+# Format of this file is:
+#  [option]*
+# where option can be:
+#   --ignore: ignore this POM and its artifact if any
+#   --ignore-pom: don't install the POM. To use on POM files that are created
+# temporarily for certain artifacts such as Javadoc jars. [mh_install,