On Thu, Mar 03, 2016 at 01:38:43PM -0700, Davide Del Vento wrote: > I know this is suboptimal, but I think that's the best you can do at > the moment (and that assumes that at least one dataset would fit in > your disk, which for climate datasets could be a generous > assumption).
Depending on how you organize/access your data, IPFS [1] might be a
good solution for distributing your data over multiple machines while
still being able to easily access the subset you need from a single
host. For examlpe, if your huge data is setup like
.
|-- 2014
| `-- …
|-- 2015
| `-- …
`-- 2016
`-- …
IPFS would be good if you only needed one year at a time on the local
disk. It wouldn't be good if you needed January data across a range
of years, unless someone had also setup an index by month:
.
|-- 01
| `-- …
|-- 02
| `-- …
|-- 03
… `-- …
The data is content-addressable, so 2014/01/some-data (via the first
indexing scheme) and 01/2014/some-data (via the second indexing
scheme) would both use the same local object for the ‘some-data’ leaf.
And while there are plans to build Git-like version control onto IPFS,
I don't think anyone has gotten around to that yet. With the current
version, you get immutable Merkle hashes that uniquely identify your
data [2], but you don't have commit objects linking those snapshots
together.
Anyhow, IPFS is still pretty new and fluxy, so I wouldn't trust it as
the sole location of important data, but folks who are bumping up
against data management issues might want to give it a spin.
Cheers,
Trevor
[1]: https://ipfs.io/
[2]: https://en.wikipedia.org/wiki/Merkle_tree
--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
