On Feb 23, 2010, at 1:44 PM, Dmitry Karpeev wrote: > Yes, but what about using Spotlight programmatically (e.g., from > PETSc) to store rich state, > checkpointing, etc? > For example, I want to store a Vec. How do I label it? There maybe > various user contexts > that share it, so I'd like to label it with all of them. > > In a way, I don't to have to look at my home directory (or any > directory) at all. > I just want to extract files based on a given (set of) label(s). > Yes
> Dmitry. > > On Tue, Feb 23, 2010 at 1:40 PM, Barry Smith <bsmith at mcs.anl.gov> > wrote: >> >> With google (and Spotlight on the Mac) is there any need to >> organize >> anything anymore? Just burp down the data any way you please >> anywhere you >> want it and then have smart search tools find it for you and format >> it the >> way you need it at the time you need it? This does mean you need >> decent >> tools to parse random stuff for the search to understand it. >> >> Ironically in the past few years with Spotlight on my Mac I >> actually do a >> better job of organizing my home directory structure then I ever have >> before. >> >> Barry >> >> On Feb 23, 2010, at 1:31 PM, Dmitry Karpeev wrote: >> >>> This takes the discussion in a somewhat tangential direction, but >>> consider >>> this: >>> >>> We use hierarchical file systems, which are also a pain. >>> Say, I'm working on project PETSc and I'm writing a DOE proposal >>> for it. >>> Should I put it in ~/PETSc/Proposals/DOE/proposal or >>> ~/Proposals/DOE/PETSc/proposal or >>> ~/Proposals/PETSc/DOE? >>> Later (3 months from now) I might want to come back and retrieve a >>> file from that proposal tree. >>> Where do I look for it? >>> Maybe I should have all of these paths, all but one being soft links >>> to the master path? >>> I've tried that. It's a pain. >>> >>> Basically, any hierarchical storage format, such as a file system, >>> will impose a tree structure on >>> what is fundamentally a (hyper)graph. >>> GMail solves a similar problem by allowing multiple labels on a >>> piece of >>> email. >>> Then I can search on any or several of the labels: Proposals, DOE, >>> PETSc, irrespective of the order. >>> A file system imposes an artificial order. >>> You can think of labels as being the hyperedges in the hypergraph. >>> >>> It would be nice to have a file system that functioned a bit like >>> GMail, I think. >>> In fact, I've thought about writing a Python replacement for 'ls', >>> that would list files with a given label or labels. I'm too lazy >>> and >>> incompetent, however. >>> In the simplest case the metadata could go right into the filename, >>> but maybe that's not >>> a good thing to do in general. >>> >>> >>> Dmitry. >>> >>> On Tue, Feb 23, 2010 at 10:24 AM, Barry Smith <bsmith at mcs.anl.gov> >>> wrote: >>>> >>>> I've thought about this be never done anything, I think it is >>>> worth >>>> investigating. >>>> >>>> BTW: My long term goal is also that all PETSc source code lives >>>> in an >>>> appropriate database with appropriate relationships and meta-data >>>> stored >>>> there. >>>> >>>> The fact that we (meaning HPC and OpenSource in general) use >>>> flat files >>>> so >>>> much shows a failure of something. >>>> >>>> Barry >>>> >>>> On Feb 23, 2010, at 9:31 AM, Jed Brown wrote: >>>> >>>>> Matt and I talked about this a couple months ago, but I'd like >>>>> to also >>>>> mention it here. It seems to me that data formats like HDF5 are >>>>> really >>>>> a pain to use for generic purposes, because you end up trying to >>>>> map a >>>>> directed graph of object relations (composition) into a >>>>> hierarchical >>>>> data format, and then implement relational queries on top of this >>>>> hierarchy. (I've done this, to some extent, and I ended up >>>>> writing >>>>> cumbersome code to walk this hierarchy to answer queries that >>>>> would be >>>>> one-line SQL queries.) >>>>> >>>>> To elaborate slightly on the problem, the goal would be to write >>>>> vectors >>>>> living on a DMComposite, with extra semantics like time step and >>>>> units, >>>>> in a way that could be used for visualization as well as >>>>> checkpoints for >>>>> forward and adjoint models. PETSc's unadorned binary IO is fine >>>>> if the >>>>> same code is going to read it back in, because everything will >>>>> be wired >>>>> up correctly and we're just loading into a Vec (although it's >>>>> already >>>>> somewhat tricky when the layout changes in the unstructured >>>>> case). But >>>>> there just isn't enough metadata to operate on in any sort of >>>>> generic >>>>> way, and I hate writing custom code to describe meshes and >>>>> relations >>>>> between them. >>>>> >>>>> Current scientific data formats (at least those I have seen) are a >>>>> hassle to use since they have poor support for expressing >>>>> relations. >>>>> HDF5 has the equivalent of file-system symlinks, but after >>>>> normalization, all the relations end up being encoded as a bunch >>>>> of >>>>> symlinks, which is a relatively low-level view and isn't a >>>>> particularly >>>>> convenient thing to traverse when answering a query. >>>>> >>>>> So I'm curious if anyone has put such metadata into a relational >>>>> database instead of trying to contort it into one of these >>>>> "scientific" >>>>> data formats. My thought would be to drop only the metadata into >>>>> something like Sqlite, and write the arrays themselves using MPI- >>>>> IO (or >>>>> HDF5/NetCDF/whatever, but these don't provide much when we >>>>> aren't using >>>>> them for metadata). This would allow efficient support of >>>>> queries like >>>>> "all vector fields at step M" and "fields B and C from step M to >>>>> N on >>>>> subdomains intersecting bounding box XYZ". This isn't completely >>>>> different from what XDMF tries to do, but experimentation with >>>>> that left >>>>> a sour taste. Is SQL a stupid idea for this purpose and I'd be >>>>> better >>>>> off writing code to support the queries I want on HDF5/XDMF/ >>>>> something >>>>> else? >>>>> >>>>> Jed >>>> >>>> >> >>
