Use guix to distribute data & reproducible (data) science

Amirouche Boubekki Fri, 09 Feb 2018 08:33:08 -0800

Héllo all,

tl;dr: Distribution of data and software seems similar.
       Data is more and more important in software and reproducible
       science. Data science ecosystem lakes resources sharing.
       I think guix can help.


Recently I stumbled upon open data movement and its links with
data science.

To give a high level overview, there is several (web) platforms
that allows administrations and companies to publish data and
_distribute_ it. Example of such platforms are data.gouv.fr [1] and
various other platforms based on CKAN [2].

[1] https://www.data.gouv.fr/
[2] https://okfn.org/projects/

I have worked with data.gouv.fr in particular. And the repository
is rather poor in terms of quality. Making very difficult to use.

The other side of this open data and data based software is the
fact that some software provide their own mechanism to _distribute_
data or binary blobs called 'models' that are sometime based on
libre data. Example of such softwares are spacy [2], gensim [3],
nltk [4] and word2vec.

[2] https://spacy.io/
[3] https://en.wikipedia.org/wiki/Gensim
[4] http://www.nltk.org/

My last point is that it's common knowledge that data wrangling
aka. cleaning and preparing data is 80% of data scientist job.
It's required because data distributors don't do it right, because
they don't have the man power and the knowledge to do it right.

To summarize:

1) Some software and platforms distribute _data_ themselves in some
   "closed garden" way. It's not the role of software to distribute
   data especially when that data can be reused in other contexts.

2) models are binary blobs that you use in the hope they do what they
   are supposed to do. How do you build the model? Is the model
   reproducible?

3) Preparing data must be re-done all the time, let's share resource
   and do it once.

It seems to me that guix has all the required feature to handle data
and models distribution.

What do people think? Do we already use guix to distribute data andmodels.


Also, it seems good to surf on AI frenzy ;)

Use guix to distribute data & reproducible (data) science

Reply via email to