Re: Re: autopkgtest requiring large data sets (pique, hinge)
Hello, On 22 December 2021 8:06:57 pm IST, Lance Lin wrote: >> No, not really. autopkgtest has a `needs-internet` restriction, so you can >> access internet to get stuff. See here: >> >> https://people.debian.org/~eriberto/README.package-tests.html >> >> But yeah, this is usually better, since the server you fetch data from might >> choke someday, or might turn unresponsive or maybe block IPs if you do >> several `get` requests to it (which the CI machines would do) and so on, >> then that's a problem. > >Would it be acceptable to create salsa repos that hold the test data for >various medical packages (pique-data, hinge-data)? After ensuring that the >data sets are public domain with appropriate credit given, we could then >reference a fixed salsa repo. It would still require the 'needs-internet' >restriction but would ensure the data is available. We had that discussion many months ago, and for several reasons, I think it's a bad idea. I've mentioned all the reasons here [1] please consider to give it a read. We eventually had a consensus to embed test data, which I then later added to our policy as well[2] This solved our problem of testing data upto a few MBs which is fine for us. But having gigabyte sized data is not very nice in any of our interests since it puts high load for us as contributors, and puts high load on CI machines as well. Infact, if the size of things you're pulling/testing exceeds many gigabytes, an RC bug will be filed against the package. One prominent example that I remember is tiddit, take a look here[3] [1]: https://lists.debian.org/debian-med/2020/09/msg00365.html [2]: https://med-team.pages.debian.net/policy/#embedding-large-test-data [3]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964101 Hope that helps clarify things a bit, Nilesh
Re: Re: autopkgtest requiring large data sets (pique, hinge)
Nilesh, Pierre, Thank you for the response. > > Yes please, making efforts to write tests is definitely worth it. From my > > experience, you might contact upstream developers to ask them for > > meaningful commands requiring no more data that the ones that are in the > > source tree. Friendly upstreams usually I would second that. If possible, ask upstream for sensible data size that is > manageable under a few MBs. I will reach out to the upstreams to see if they have any smaller test cases. > No, not really. autopkgtest has a `needs-internet` restriction, so you can > access internet to get stuff. See here: > > https://people.debian.org/~eriberto/README.package-tests.html > > But yeah, this is usually better, since the server you fetch data from might > choke someday, or might turn unresponsive or maybe block IPs if you do > several `get` requests to it (which the CI machines would do) and so on, then > that's a problem. Would it be acceptable to create salsa repos that hold the test data for various medical packages (pique-data, hinge-data)? After ensuring that the data sets are public domain with appropriate credit given, we could then reference a fixed salsa repo. It would still require the 'needs-internet' restriction but would ensure the data is available. Based on Tony's response in the thread, perhaps the data sets for this type of processing are large out of necessity? This is what led me to think of the above solution. Lance Lin GPG Fingerprint: 8CAD 1250 8EE0 3A41 7223 03EC 7096 F91E D75D 028F signature.asc Description: OpenPGP digital signature
Re: autopkgtest requiring large data sets (pique, hinge)
On 21/12/2021 21:12, Steven Robbins wrote: On Tuesday, December 21, 2021 10:22:49 A.M. CST Nilesh Patra wrote: On 12/21/21 9:00 PM, Pierre Gruet wrote: On 21/12/2021 14:33, Lance Lin wrote: Debian Medical Team, I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples.>> The sizes are several GB. I would second that. If possible, ask upstream for sensible data size that is manageable under a few MBs. I understand the motivation here -- it is unwieldy and unusual to have GB- sized test data. Irrespective of what I write below, it is always nice to have a "small" smoke-test data set so I support asking upstream in that spirit. It may be the case that upstream is able to get the same code coverage out of a smaller test data set. Or maybe they can get a reduced-but-still-useful coverage. But in the days of "big data", it might be the case that testing the software really requires a big dataset. What are Debian's options for this? Hi, Steve. I'm the author of PIQUE - In fact the dataset that I use to test PIQUE is small in comparison to the datasets that we normally use for GWAS and I included a Makefile to download it, rather than including it in the repo. Bye, Tony. -- Minke Informatics Limited, Registered in Scotland - Company No. SC419028 Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK) tel. +44(0)19755 63548http://minke-informatics.co.uk mob. +44(0)7985 078324mailto:tony.tra...@minke-informatics.co.uk
Re: autopkgtest requiring large data sets (pique, hinge)
On 21/12/2021 13:33, Lance Lin wrote: Debian Medical Team, I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples. The sizes are several GB. It also looks like they may be graphical in nature. Is it permissible for autopkgtest suites to download such large amounts of data from the internet? Or should they be included in the repo? If so, I am happy to continue to look at these packages to include some basic level of testing. Thank you! As I don't have direct domain knowledge of the technology Lance Lin GPG Fingerprint: 8CAD 1250 8EE0 3A41 7223 03EC 7096 F91E D75D 028F Hi, Lance. I'm the author of PIQUE and I included a "test" folder with a Makefile in it that downloads an example dataset. The data is not in the repo. HTH, Tony. -- Minke Informatics Limited, Registered in Scotland - Company No. SC419028 Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK) tel. +44(0)19755 63548http://minke-informatics.co.uk mob. +44(0)7985 078324mailto:tony.tra...@minke-informatics.co.uk
Re: autopkgtest requiring large data sets (pique, hinge)
On Tuesday, December 21, 2021 10:22:49 A.M. CST Nilesh Patra wrote: > On 12/21/21 9:00 PM, Pierre Gruet wrote: > > On 21/12/2021 14:33, Lance Lin wrote: > >> Debian Medical Team, > >> > >> I have started looking at adding autopkgtest suites for a variety of > >> packages. Two of the packages (hinge, pique) require very large data > >> sets to run their included examples.>> > >>The sizes are several GB. > I would second that. If possible, ask upstream for sensible data size that > is manageable under a few MBs. I understand the motivation here -- it is unwieldy and unusual to have GB- sized test data. Irrespective of what I write below, it is always nice to have a "small" smoke-test data set so I support asking upstream in that spirit. It may be the case that upstream is able to get the same code coverage out of a smaller test data set. Or maybe they can get a reduced-but-still-useful coverage. But in the days of "big data", it might be the case that testing the software really requires a big dataset. What are Debian's options for this? -Steve
Re: autopkgtest requiring large data sets (pique, hinge)
On 12/21/21 9:00 PM, Pierre Gruet wrote: On 21/12/2021 14:33, Lance Lin wrote: Debian Medical Team, I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples. The sizes are several GB. That's a bit of a problem. Usually very large datasets unless present in tree source itself are hard to manage. Can you try some sort of compression algorithm on it (converting to .xz etc) and see if you are able to manage the size to few megs? It also looks like they may be graphical in nature. Yeah, however sometimes it is possible to manage the graphical tests with xvfb, provided you don't need to explicitly click buttons on the UI. Thanks for working on adding tests to the packages. For sure this is always useful! +1 Is it permissible for autopkgtest suites to download such large amounts of data from the internet? Or should they be included in the repo? Autopkgtests must be able to run on a minimal system with only the unpacked source tree, the dependencies of the binary packages, and no access to the Internet. This implies that one cannot rely on downloading while running the tests: No, not really. autopkgtest has a `needs-internet` restriction, so you can access internet to get stuff. See here: https://people.debian.org/~eriberto/README.package-tests.html the data that you use have to be included in the installed binary packages or to lie in the source tree. But yeah, this is usually better, since the server you fetch data from might choke someday, or might turn unresponsive or maybe block IPs if you do several `get` requests to it (which the CI machines would do) and so on, then that's a problem. If so, I am happy to continue to look at these packages to include some basic level of testing. Yes please, making efforts to write tests is definitely worth it. From my experience, you might contact upstream developers to ask them for meaningful commands requiring no more data that the ones that are in the source tree. Friendly upstreams usually I would second that. If possible, ask upstream for sensible data size that is manageable under a few MBs. Regards, Nilesh OpenPGP_signature Description: OpenPGP digital signature
Re: autopkgtest requiring large data sets (pique, hinge)
Hello Lance, On 21/12/2021 14:33, Lance Lin wrote: Debian Medical Team, I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples. The sizes are several GB. It also looks like they may be graphical in nature. Thanks for working on adding tests to the packages. For sure this is always useful! Is it permissible for autopkgtest suites to download such large amounts of data from the internet? Or should they be included in the repo? Autopkgtests must be able to run on a minimal system with only the unpacked source tree, the dependencies of the binary packages, and no access to the Internet. This implies that one cannot rely on downloading while running the tests: the data that you use have to be included in the installed binary packages or to lie in the source tree. If so, I am happy to continue to look at these packages to include some basic level of testing. Yes please, making efforts to write tests is definitely worth it. From my experience, you might contact upstream developers to ask them for meaningful commands requiring no more data that the ones that are in the source tree. Friendly upstreams usually Thank you! As I don't have direct domain knowledge of the technology Hope this is useful, Best, -- Pierre Lance Lin GPG Fingerprint: 8CAD 1250 8EE0 3A41 7223 03EC 7096 F91E D75D 028F P.S.: I am CC'ing explicitly as I don't know if you subscribed to the debian-med list. Next time I won't, assuming you did.
autopkgtest requiring large data sets (pique, hinge)
Debian Medical Team, I have started looking at adding autopkgtest suites for a variety of packages. Two of the packages (hinge, pique) require very large data sets to run their included examples. The sizes are several GB. It also looks like they may be graphical in nature. Is it permissible for autopkgtest suites to download such large amounts of data from the internet? Or should they be included in the repo? If so, I am happy to continue to look at these packages to include some basic level of testing. Thank you! As I don't have direct domain knowledge of the technology Lance Lin GPG Fingerprint: 8CAD 1250 8EE0 3A41 7223 03EC 7096 F91E D75D 028F signature.asc Description: OpenPGP digital signature