Hi EasyBuilders,

recently I stumbled upon the task of downloading Google’s Open Image Dataset V4 
[1]. It’s a dataset creating
for training and validating image recognition machine learning engines.

Basically, everyone on this ML field downloads at least one of such datasets, 
called Imagenet [2].

There are many other datasets which are shared among pretty much every 
scientist of a specific field. For
another example, the copernicus datasets for earth sciences [3].

In that sense, a dataset is a tool - it’s not different from BOOST or NumPy or 
GROMACS.

Given that they are used the same way by everyone, and they tend to be massive 
([1] is 19tb, and it’s quite
small), it makes a huge sense, in a administrative way, to have them identical 
and shared among all users.

Which would mean, in supercomputers of research institutions, as a module one 
can load.

Does that make any sense for you? To have easyconfigs which “install” (i.e. 
download and unpack) datasets
in standard locations on the system, reproducible across systems?

What do you think?

Thanks for the attention, and merry xmas :-)



[1] https://storage.googleapis.com/openimages/web/index.html 
<https://storage.googleapis.com/openimages/web/index.html>
[2] http://www.image-net.org <http://www.image-net.org/>
[3] https://www.copernicus.eu/en <https://www.copernicus.eu/en>

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to