As far as I know, currently we are in a "rolling your own" mode. If
the datasets you are using already have their own DOI or permalink (or
even almost permalinks) such as on http://rda.ucar.edu/ or ESGF, the
best way would be to write a "thin wrapper" to do the data access.
This wrapper downloads the data from the origin and cache it locally
for further requests (and keep this wrapper in version control, like
all the rest of the pipeline).
I know this is suboptimal, but I think that's the best you can do at
the moment (and that assumes that at least one dataset would fit in
your disk, which for climate datasets could be a generous assumption).

I am involved in something that could "ease the pain" and would even
work if your local disks won't take the whole datasets, but that is at
the write-the-proposal-for-funding-agency stage.

On Thu, Mar 3, 2016 at 8:49 AM, Matthew Gidden <[email protected]> wrote:
> Hi all,
>
> A few months ago I started a thread on the Data Carpentry forums about best
> practices for data management [1]. It got a small amount of traction and
> cited the recent work by a few on this list on good-enough practices [2].
> [2] has a section called "Version Control for Data?", which lists some
> guiding principles, but doesn't have concrete prescriptions. I'm hoping to
> hear from others how they choose to manage workflows for larger datasets,
> both raw and refined.
>
> To motivate the discussion, a short recap of [1]. Take for example a climate
> model. The model takes as input highly detailed spatial data and
> prepossesses it by aggregating to a coarser resolution (say this is binary
> data like a raster). Further, it takes as input different assumptions about
> the future, e.g., how GHG emissions change as a function of time and region
> of the world (say this is tabluar data, like a csv or sql database). The
> model will be run for a variety of "scenarios" (assumptions about the
> future) and the results will be used by other modelers downstream. Finally,
> the internals of the model or the input data may be updated with some
> frequency and results regenerated.
>
> Are there established best practices for how to manage and version this kind
> of workflow? Some mix of version control + DOIs seems like the best choice I
> can think of, but this inherently requires some amount of "rolling your own"
> -- not that that's a bad thing. Does anyone have any thoughts for how this
> kind of workflow scales in the input data, pre/post processing, and number
> of scenario dimensions?
>
> I'm looking forward to any feedback.
>
> Cheers,
> Matt
>
> [1]
> http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6
> [2]
> http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
>
> _______________________________________________
> Discuss mailing list
> [email protected]
> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Reply via email to