For large data sets it doesn't quite make sense to use EB's setup. EB downloads the "source" into its own source repository and then unpacks it. But that downloades "source" will, in this case, just waste space. It's probably better to just download and unpack the dataset on a file system that is large enough, throw away the downloaded "source" and then create a module by hand that defines some env variable that users can use to point their code to.
But there is nothing inherently wrong with using EB for this type of things, we already have this for the (quite small) VASP data files for instance. It just boils down to how much file space you have for EB's "sourcepath" file system. One could enhance EB to say that a specific "source" file should be removed from the "sourcepath" after installation to avoid having large datasets occupy file space two twice. On 12/16/18 10:36 AM, Strube, Alexandre wrote: > Hi EasyBuilders, > > > recently I stumbled upon the task of downloading Google’s Open Image > Dataset V4 [1]. It’s a dataset creating > for training and validating image recognition machine learning engines. > > Basically, everyone on this ML field downloads at least one of such > datasets, called Imagenet [2]. > > There are many other datasets which are shared among pretty much every > scientist of a specific field. For > another example, the copernicus datasets for earth sciences [3]. > > In that sense, a dataset is a tool - it’s not different from BOOST or > NumPy or GROMACS. > > Given that they are used the same way by everyone, and they tend to be > massive ([1] is 19tb, and it’s quite > small), it makes a huge sense, in a administrative way, to have them > identical and shared among all users. > > Which would mean, in supercomputers of research institutions, as a > module one can load. > > Does that make any sense for you? To have easyconfigs which “install” > (i.e. download and unpack) datasets > in standard locations on the system, reproducible across systems? > > What do you think? > > Thanks for the attention, and merry xmas :-) > > > > [1] https://storage.googleapis.com/openimages/web/index.html > [2] http://www.image-net.org > [3] https://www.copernicus.eu/en > -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

