Alexandre, this is not a direct reply regarding Easybuild. Please look at this video of Lyndon White's talk at JuliaCon 2018 https://www.youtube.com/watch?v=kSlQpzccRaI&feature=youtu.be
The talk discusses Julia's DataDeps https://www.juliabloggers.com/datadeps-jl-repeatabled-data-setup-for-repeatable-science/ I think this may have a lot of resonance for you. On Sun, 16 Dec 2018 at 13:10, Åke Sandgren <[email protected]> wrote: > For large data sets it doesn't quite make sense to use EB's setup. > EB downloads the "source" into its own source repository and then > unpacks it. But that downloades "source" will, in this case, just waste > space. > It's probably better to just download and unpack the dataset on a file > system that is large enough, throw away the downloaded "source" and then > create a module by hand that defines some env variable that users can > use to point their code to. > > But there is nothing inherently wrong with using EB for this type of > things, we already have this for the (quite small) VASP data files for > instance. > > It just boils down to how much file space you have for EB's "sourcepath" > file system. > > One could enhance EB to say that a specific "source" file should be > removed from the "sourcepath" after installation to avoid having large > datasets occupy file space two twice. > > On 12/16/18 10:36 AM, Strube, Alexandre wrote: > > Hi EasyBuilders, > > > > > > recently I stumbled upon the task of downloading Google’s Open Image > > Dataset V4 [1]. It’s a dataset creating > > for training and validating image recognition machine learning engines. > > > > Basically, everyone on this ML field downloads at least one of such > > datasets, called Imagenet [2]. > > > > There are many other datasets which are shared among pretty much every > > scientist of a specific field. For > > another example, the copernicus datasets for earth sciences [3]. > > > > In that sense, a dataset is a tool - it’s not different from BOOST or > > NumPy or GROMACS. > > > > Given that they are used the same way by everyone, and they tend to be > > massive ([1] is 19tb, and it’s quite > > small), it makes a huge sense, in a administrative way, to have them > > identical and shared among all users. > > > > Which would mean, in supercomputers of research institutions, as a > > module one can load. > > > > Does that make any sense for you? To have easyconfigs which “install” > > (i.e. download and unpack) datasets > > in standard locations on the system, reproducible across systems? > > > > What do you think? > > > > Thanks for the attention, and merry xmas :-) > > > > > > > > [1] https://storage.googleapis.com/openimages/web/index.html > > [2] http://www.image-net.org > > [3] https://www.copernicus.eu/en > > > > -- > Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden > Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14 > Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se >

