Re: Persistent meta-data store for scientific data

Aidan Heerdegen Sat, 08 Dec 2018 02:15:37 -0800

Hi Kevin,

> This is an interesting question. I'm in computing for high energy physics, 
> where there is custom software used for tracking the data files, their 
> locations, etc. I hadn't considered the application of Perkeep in this area 
> (I've only toyed with it as a personal project). For tracking where the data 
> is the HEP community has been moving to Rucio https://rucio.cern.ch/. The 
> metadata, however, is still a bit of wildcard-Rucio doesn't currently have a 
> native metadata store. I'm interested in seeing if Perkeep can be offered as 
> a solution here, but if not I may be able to provide an alternative.


Thanks for the pointer to rucio, looks interesting with some great features.

Our issue is that we really don't have anything like rucio on offer (though I 
will ask, nothing ever happens quickly, and I'm not holding my breath).

So I'd like to be able to have at least some way of knowing where a file has 
been, and what relationship it has to ancestors/siblings/parents.

Nothing like a concrete example:

I support Ocean modellers (and some coupled climate modelling). The models have 
many input files. For example, we have an input file that specifies sea surface 
salinity which is derived from observations. Now at some point it may be that 
there is found to be an error in that file, or perhaps just an update. So we 
start using the updated file, and we have ways of tracking the hash of inputs 
to our experiment, so we can uniquely identify the file we used, and the last 
location it was accessed. But in comparing experiments before and after, with 
different sea surface salinity files, we don't know the relationship between 
them. Clearly we should have noted this in the metadata of the updated file, 
but we all know any system that relies on people to do the right thing will 
inevitably break down.

Ideally if we needed to retrieve the original input file we could find it in 
the location from which it was last used, but if it isn't there, where has it 
gone? Now I'm in the process of writing a little something to dump some of this 
information to a sqlite database when we archive files to tape storage, so this 
could be queried for matches to the hash of interest. But this is a number of 
disparate systems all with a little bit of metadata and some way of tying them 
all together. It seemed like perkeep might be a good way of coalescing most of 
it into a single, flexible system.

Cheers

Aidan

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Persistent meta-data store for scientific data

Reply via email to