For large data sets it doesn't quite make sense to use EB's setup.
EB downloads the "source" into its own source repository and then
unpacks it. But that downloades "source" will, in this case, just waste
space.
It's probably better to just download and unpack the dataset on a file
system that is large enough, throw away the downloaded "source" and then
create a module by hand that defines some env variable that users can
use to point their code to.

But there is nothing inherently wrong with using EB for this type of
things, we already have this for the (quite small) VASP data files for
instance.

It just boils down to how much file space you have for EB's "sourcepath"
file system.

One could enhance EB to say that a specific "source" file should be
removed from the "sourcepath" after installation to avoid having large
datasets occupy file space two twice.

On 12/16/18 10:36 AM, Strube, Alexandre wrote:
> Hi EasyBuilders,
> 
> 
> recently I stumbled upon the task of downloading Google’s Open Image
> Dataset V4 [1]. It’s a dataset creating
> for training and validating image recognition machine learning engines.
> 
> Basically, everyone on this ML field downloads at least one of such
> datasets, called Imagenet [2].
> 
> There are many other datasets which are shared among pretty much every
> scientist of a specific field. For 
> another example, the copernicus datasets for earth sciences [3].
> 
> In that sense, a dataset is a tool - it’s not different from BOOST or
> NumPy or GROMACS. 
> 
> Given that they are used the same way by everyone, and they tend to be
> massive ([1] is 19tb, and it’s quite 
> small), it makes a huge sense, in a administrative way, to have them
> identical and shared among all users.
> 
> Which would mean, in supercomputers of research institutions, as a
> module one can load.
> 
> Does that make any sense for you? To have easyconfigs which “install”
> (i.e. download and unpack) datasets
> in standard locations on the system, reproducible across systems?
> 
> What do you think?
> 
> Thanks for the attention, and merry xmas :-)
> 
> 
> 
> [1] https://storage.googleapis.com/openimages/web/index.html
> [2] http://www.image-net.org
> [3] https://www.copernicus.eu/en
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected]   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Reply via email to