Re: Persistent meta-data store for scientific data

Aidan Heerdegen Mon, 10 Dec 2018 02:10:05 -0800

Hi Jon,

Thanks, and yes I did find that a while ago. It does look like a very 
interesting project. I've had another look and maybe we could use it, but I'm 
not sure it is quite what I am after. dat is placing an emphasis on sharing and 
versioning datafiles. I want to promiscuously capture all the links between my 
files.

For example, say I have an ocean model with some input files:

input_a
input_b
input_c

I run my model, which uses file hashes to uniquely identify the inputs used, 
and the hash of the executable used to run the model.

I can now tag my output files:

output_tile_1
output_tile_2
..
output_tile_200

with the git hash for my model run. These are tiled files, so I will need to 
run another post processing program to stitch them all back together.

Now I can (and should) inherit the git hash that is shared by the tiled files 
and place it in the output_tiled file. 

So with some careful combing of meta-data I can recreate the relationships 
between these files, but I think something like perkeep could be so much more 
straightforward.

When the output_tiled is created I could add the relationship to all the tiled 
files. When I delete the tiled files I can keep the metadata which links them 
back to the executable that created them (directly in it's metadata) and the 
input files that were used in the experiment.

The features of perkeep that I like, and think is valuable in this context, is 
never throwing anything away and storing everything. Failed or old experiments 
that have been deleted are no longer useful, but there is use in knowing they 
existed and confirming that there is no point scouring the disk for them, they 
are gone. I like the promiscuity because sometimes you don't know what you want 
to keep, and rather than torture yourself about that just keep it all!

I also like the promise of being able to share/coalesce perkeep data stores. 
We're not sysadmins, so we need a system that isolated and individual users can 
populate their own databases, make them readable/shareable, and coalesce the 
data so we can search all the experiments.

Cheers

Aidan

> On 9 Dec 2018, at 5:01 am, Jon Van Oast <[email protected]> wrote:
> 
> hello, aidan.   have you heard of the dat project ?   it seems like it might 
> address your needs.  i have not used it much other than dabbling with it a 
> small amount, so cant speak to pros and cons.   but it has been on my radar 
> for scientific data work i do....
> 
> best,
> -jon
> 
> On Tuesday, December 4, 2018 at 3:01:11 PM UTC-8, Aidan Heerdegen wrote:
> Hi,
> 
> I have a use case where I specifically do not want to store my data files.
> 
> They are scientific datasets, usually model output, which might total 
> hundreds of TB or more. 
> 
> What I would like to do is store the meta-data from the datasets, their 
> location(s) and some transactional information, if they are moved, deleted 
> etc.
> 
> It is essential that the original data can be deleted, in some cases it is no 
> longer required, and in others it might be backed up to a slow to access tape 
> based data silo.
> 
> Each user would have their own perkeep store, but the ability to 
> coalesce/share the information in those stores would be almost essential.
> 
> Does this sound like a use case for perkeep? I like the idea of keeping ALL 
> my metadata (it is pretty small), using it to find files with particular 
> characteristics, retrieve them from a backup storage location, that sort of 
> thing.
> 
> If perkeep is a good match, can anyone suggest what "mappings" I might need 
> to think about in perkeep terms? e.g.
> 
> A permanode would be required for each unique file instance? I would be 
> storing an identifying hash of some sort to ensure it was unique. 
> 
> If a file is modified such that the hash changes, I would need a new 
> permanode, but would like to keep a relationship between the two files.
> 
> Most data files are netCDF, so I would like to dump the metadata as a JSON 
> blob and associate that data with a permanode. I guess perkeep has existing 
> methods for dealing with JSON? And indexing/searching it?
> 
> Thanks very much for any help,
> 
> Cheers
> 
> Aidan
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Persistent meta-data store for scientific data

Reply via email to