Re: [easybuild] Massive Datasets as easyconfigs

John Hearns Sun, 16 Dec 2018 14:44:10 -0800

Alexandre, this is not a direct reply regarding Easybuild.
Please look at this video of Lyndon White's talk at JuliaCon 2018
https://www.youtube.com/watch?v=kSlQpzccRaI&feature=youtu.be


The talk discusses Julia's DataDeps

https://www.juliabloggers.com/datadeps-jl-repeatabled-data-setup-for-repeatable-science/
I think this may have a lot of resonance for you.









On Sun, 16 Dec 2018 at 13:10, Åke Sandgren <[email protected]>
wrote:

> For large data sets it doesn't quite make sense to use EB's setup.
> EB downloads the "source" into its own source repository and then
> unpacks it. But that downloades "source" will, in this case, just waste
> space.
> It's probably better to just download and unpack the dataset on a file
> system that is large enough, throw away the downloaded "source" and then
> create a module by hand that defines some env variable that users can
> use to point their code to.
>
> But there is nothing inherently wrong with using EB for this type of
> things, we already have this for the (quite small) VASP data files for
> instance.
>
> It just boils down to how much file space you have for EB's "sourcepath"
> file system.
>
> One could enhance EB to say that a specific "source" file should be
> removed from the "sourcepath" after installation to avoid having large
> datasets occupy file space two twice.
>
> On 12/16/18 10:36 AM, Strube, Alexandre wrote:
> > Hi EasyBuilders,
> >
> >
> > recently I stumbled upon the task of downloading Google’s Open Image
> > Dataset V4 [1]. It’s a dataset creating
> > for training and validating image recognition machine learning engines.
> >
> > Basically, everyone on this ML field downloads at least one of such
> > datasets, called Imagenet [2].
> >
> > There are many other datasets which are shared among pretty much every
> > scientist of a specific field. For
> > another example, the copernicus datasets for earth sciences [3].
> >
> > In that sense, a dataset is a tool - it’s not different from BOOST or
> > NumPy or GROMACS.
> >
> > Given that they are used the same way by everyone, and they tend to be
> > massive ([1] is 19tb, and it’s quite
> > small), it makes a huge sense, in a administrative way, to have them
> > identical and shared among all users.
> >
> > Which would mean, in supercomputers of research institutions, as a
> > module one can load.
> >
> > Does that make any sense for you? To have easyconfigs which “install”
> > (i.e. download and unpack) datasets
> > in standard locations on the system, reproducible across systems?
> >
> > What do you think?
> >
> > Thanks for the attention, and merry xmas :-)
> >
> >
> >
> > [1] https://storage.googleapis.com/openimages/web/index.html
> > [2] http://www.image-net.org
> > [3] https://www.copernicus.eu/en
> >
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: [email protected]   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>

Re: [easybuild] Massive Datasets as easyconfigs

Reply via email to