Re: Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-22 Thread Nilesh Patra
Hello,

On 22 December 2021 8:06:57 pm IST, Lance Lin  wrote:
>> No, not really. autopkgtest has a `needs-internet` restriction, so you can 
>> access internet to get stuff. See here:
>>
>> https://people.debian.org/~eriberto/README.package-tests.html
>>
>> But yeah, this is usually better, since the server you fetch data from might 
>> choke someday, or might turn unresponsive or maybe block IPs if you do 
>> several `get` requests to it (which the CI machines would do) and so on, 
>> then that's a problem.
>
>Would it be acceptable to create salsa repos that hold the test data for 
>various medical packages (pique-data, hinge-data)? After ensuring that the 
>data sets are public domain with appropriate credit given, we could then 
>reference a fixed salsa repo. It would still require the 'needs-internet' 
>restriction but would ensure the data is available.

We had that discussion many months ago, and for several reasons, I think it's a 
bad idea.
I've mentioned all the reasons here [1] please consider to give it a read.

We eventually had a consensus to embed test data, which I then later added to 
our policy as well[2]

This solved our problem of testing data upto a few MBs which is fine for us.
But having gigabyte sized data is not very nice in any of our interests since 
it puts high load for us as contributors, and puts high load on CI machines as 
well.

Infact, if the size of things you're pulling/testing exceeds many gigabytes, an 
RC bug will be filed against the package. One prominent example that I remember 
is tiddit, take a look here[3]

[1]: https://lists.debian.org/debian-med/2020/09/msg00365.html
[2]:
https://med-team.pages.debian.net/policy/#embedding-large-test-data
[3]: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964101

Hope that helps clarify things a bit,
Nilesh



Re: Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-22 Thread Lance Lin
Nilesh, Pierre,

Thank you for the response.

> > Yes please, making efforts to write tests is definitely worth it. From my 
> > experience, you might contact upstream developers to ask them for 
> > meaningful commands requiring no more data that the ones that are in the 
> > source tree. Friendly upstreams usually  I would second that. If possible, ask upstream for sensible data size that is 
> manageable under a few MBs.

I will reach out to the upstreams to see if they have any smaller test cases.

> No, not really. autopkgtest has a `needs-internet` restriction, so you can 
> access internet to get stuff. See here:
>
> https://people.debian.org/~eriberto/README.package-tests.html
>
> But yeah, this is usually better, since the server you fetch data from might 
> choke someday, or might turn unresponsive or maybe block IPs if you do 
> several `get` requests to it (which the CI machines would do) and so on, then 
> that's a problem.

Would it be acceptable to create salsa repos that hold the test data for 
various medical packages (pique-data, hinge-data)? After ensuring that the data 
sets are public domain with appropriate credit given, we could then reference a 
fixed salsa repo. It would still require the 'needs-internet' restriction but 
would ensure the data is available.

Based on Tony's response in the thread, perhaps the data sets for this type of 
processing are large out of necessity? This is what led me to think of the 
above solution.

Lance Lin 
GPG Fingerprint:  8CAD 1250 8EE0 3A41 7223  03EC 7096 F91E D75D 028F

signature.asc
Description: OpenPGP digital signature


Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Tony Travis

On 21/12/2021 21:12, Steven Robbins wrote:

On Tuesday, December 21, 2021 10:22:49 A.M. CST Nilesh Patra wrote:

On 12/21/21 9:00 PM, Pierre Gruet wrote:

On 21/12/2021 14:33, Lance Lin wrote:

Debian Medical Team,

I have started looking at adding autopkgtest suites for a variety of
packages. Two of the packages (hinge, pique) require very large data
sets to run their included examples.>>
The sizes are several GB.



I would second that. If possible, ask upstream for sensible data size that
is manageable under a few MBs.


I understand the motivation here -- it is unwieldy and unusual to have GB-
sized test data.  Irrespective of what I write below, it is always nice to
have a "small" smoke-test data set so I support asking upstream in that
spirit.

It may be the case that upstream is able to get the same code coverage out of
a smaller test data set.  Or maybe they can get a reduced-but-still-useful
coverage.

But in the days of "big data", it might be the case that testing the software
really requires a big dataset.  What are Debian's options for this?


Hi, Steve.

I'm the author of PIQUE - In fact the dataset that I use to test PIQUE 
is small in comparison to the datasets that we normally use for GWAS and 
I included a Makefile to download it, rather than including it in the repo.


Bye,

  Tony.

--
Minke Informatics Limited, Registered in Scotland - Company No. SC419028
Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
tel. +44(0)19755 63548http://minke-informatics.co.uk
mob. +44(0)7985 078324mailto:tony.tra...@minke-informatics.co.uk



Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Tony Travis

On 21/12/2021 13:33, Lance Lin wrote:

Debian Medical Team,

I have started looking at adding autopkgtest suites for a variety of packages. 
Two of the packages (hinge, pique) require very large data sets to run their 
included examples. The sizes are several GB. It also looks like they may be 
graphical in nature.


Is it permissible for autopkgtest suites to download such large amounts of data 
from the internet? Or should they be included in the repo?

If so, I am happy to continue to look at these packages to include some basic 
level of testing.

Thank you!

As I don't have direct domain knowledge of the technology
Lance Lin 
GPG Fingerprint:  8CAD 1250 8EE0 3A41 7223  03EC 7096 F91E D75D 028F


Hi, Lance.

I'm the author of PIQUE and I included a "test" folder with a Makefile 
in it that downloads an example dataset. The data is not in the repo.


HTH,

  Tony.

--
Minke Informatics Limited, Registered in Scotland - Company No. SC419028
Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
tel. +44(0)19755 63548http://minke-informatics.co.uk
mob. +44(0)7985 078324mailto:tony.tra...@minke-informatics.co.uk



Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Steven Robbins
On Tuesday, December 21, 2021 10:22:49 A.M. CST Nilesh Patra wrote:
> On 12/21/21 9:00 PM, Pierre Gruet wrote:
> > On 21/12/2021 14:33, Lance Lin wrote:
> >> Debian Medical Team,
> >> 
> >> I have started looking at adding autopkgtest suites for a variety of
> >> packages. Two of the packages (hinge, pique) require very large data
> >> sets to run their included examples.>>
> >>The sizes are several GB.

> I would second that. If possible, ask upstream for sensible data size that
> is manageable under a few MBs.

I understand the motivation here -- it is unwieldy and unusual to have GB-
sized test data.  Irrespective of what I write below, it is always nice to 
have a "small" smoke-test data set so I support asking upstream in that 
spirit.

It may be the case that upstream is able to get the same code coverage out of 
a smaller test data set.  Or maybe they can get a reduced-but-still-useful 
coverage.

But in the days of "big data", it might be the case that testing the software 
really requires a big dataset.  What are Debian's options for this? 

-Steve





Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Nilesh Patra

On 12/21/21 9:00 PM, Pierre Gruet wrote:

On 21/12/2021 14:33, Lance Lin wrote:

Debian Medical Team,

I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples. 
The sizes are several GB.


That's a bit of a problem. Usually very large datasets unless present in tree 
source itself are hard to manage. Can you try some sort of
compression algorithm on it (converting to .xz etc) and see if you are able to 
manage the size to few megs?


It also looks like they may be graphical in nature.


Yeah, however sometimes it is possible to manage the graphical tests with xvfb, 
provided you don't need to explicitly
click buttons on the UI.


Thanks for working on adding tests to the packages. For sure this is always 
useful!


+1


Is it permissible for autopkgtest suites to download such large amounts of data 
from the internet? Or should they be included in the repo?


Autopkgtests must be able to run on a minimal system with only the unpacked 
source tree, the dependencies of the binary packages, and no access to the 
Internet. This implies that one cannot rely on downloading while running the 
tests:


No, not really. autopkgtest has a `needs-internet` restriction, so you can 
access internet to get stuff. See here:

https://people.debian.org/~eriberto/README.package-tests.html


the data that you use have to be included in the installed binary packages or 
to lie in the source tree.


But yeah, this is usually better, since the server you fetch data from might 
choke someday, or might turn unresponsive or maybe block IPs if you do several 
`get` requests to it (which the CI machines would do) and so on,
then that's a problem.
 

If so, I am happy to continue to look at these packages to include some basic 
level of testing.


Yes please, making efforts to write tests is definitely worth it. From my 
experience, you might contact upstream developers to ask them for meaningful 
commands requiring no more data that the ones that are in the source tree. 
Friendly upstreams usually 

I would second that. If possible, ask upstream for sensible data size that is 
manageable under a few MBs.

Regards,
Nilesh



OpenPGP_signature
Description: OpenPGP digital signature


Re: autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Pierre Gruet

Hello Lance,

On 21/12/2021 14:33, Lance Lin wrote:

Debian Medical Team,

I have started looking at adding autopkgtest suites for a variety of packages. 
Two of the packages (hinge, pique) require very large data sets to run their 
included examples. The sizes are several GB. It also looks like they may be 
graphical in nature.



Thanks for working on adding tests to the packages. For sure this is 
always useful!




Is it permissible for autopkgtest suites to download such large amounts of data 
from the internet? Or should they be included in the repo?


Autopkgtests must be able to run on a minimal system with only the 
unpacked source tree, the dependencies of the binary packages, and no 
access to the Internet. This implies that one cannot rely on downloading 
while running the tests: the data that you use have to be included in 
the installed binary packages or to lie in the source tree.




If so, I am happy to continue to look at these packages to include some basic 
level of testing.


Yes please, making efforts to write tests is definitely worth it. From 
my experience, you might contact upstream developers to ask them for 
meaningful commands requiring no more data that the ones that are in the 
source tree. Friendly upstreams usually 



Thank you!

As I don't have direct domain knowledge of the technology


Hope this is useful,

Best,

--
Pierre


Lance Lin 
GPG Fingerprint:  8CAD 1250 8EE0 3A41 7223  03EC 7096 F91E D75D 028F


P.S.: I am CC'ing explicitly as I don't know if you subscribed to the 
debian-med list. Next time I won't, assuming you did.




autopkgtest requiring large data sets (pique, hinge)

2021-12-21 Thread Lance Lin
Debian Medical Team,

I have started looking at adding autopkgtest suites for a variety of packages. 
Two of the packages (hinge, pique) require very large data sets to run their 
included examples. The sizes are several GB. It also looks like they may be 
graphical in nature. 


Is it permissible for autopkgtest suites to download such large amounts of data 
from the internet? Or should they be included in the repo?

If so, I am happy to continue to look at these packages to include some basic 
level of testing.

Thank you!

As I don't have direct domain knowledge of the technology
Lance Lin 
GPG Fingerprint:  8CAD 1250 8EE0 3A41 7223  03EC 7096 F91E D75D 028F

signature.asc
Description: OpenPGP digital signature