This takes the discussion in a somewhat tangential direction, but consider this:
We use hierarchical file systems, which are also a pain. Say, I'm working on project PETSc and I'm writing a DOE proposal for it. Should I put it in ~/PETSc/Proposals/DOE/proposal or ~/Proposals/DOE/PETSc/proposal or ~/Proposals/PETSc/DOE? Later (3 months from now) I might want to come back and retrieve a file from that proposal tree. Where do I look for it? Maybe I should have all of these paths, all but one being soft links to the master path? I've tried that. It's a pain. Basically, any hierarchical storage format, such as a file system, will impose a tree structure on what is fundamentally a (hyper)graph. GMail solves a similar problem by allowing multiple labels on a piece of email. Then I can search on any or several of the labels: Proposals, DOE, PETSc, irrespective of the order. A file system imposes an artificial order. You can think of labels as being the hyperedges in the hypergraph. It would be nice to have a file system that functioned a bit like GMail, I think. In fact, I've thought about writing a Python replacement for 'ls', that would list files with a given label or labels. I'm too lazy and incompetent, however. In the simplest case the metadata could go right into the filename, but maybe that's not a good thing to do in general. Dmitry. On Tue, Feb 23, 2010 at 10:24 AM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > ?I've thought about this be never done anything, I think it is worth > investigating. > > ?BTW: My long term goal is also that all PETSc source code lives in an > appropriate database with appropriate relationships and meta-data stored > there. > > ?The fact that we (meaning HPC and OpenSource in general) use flat files so > much shows a failure of something. > > ? Barry > > On Feb 23, 2010, at 9:31 AM, Jed Brown wrote: > >> Matt and I talked about this a couple months ago, but I'd like to also >> mention it here. ?It seems to me that data formats like HDF5 are really >> a pain to use for generic purposes, because you end up trying to map a >> directed graph of object relations (composition) into a hierarchical >> data format, and then implement relational queries on top of this >> hierarchy. ?(I've done this, to some extent, and I ended up writing >> cumbersome code to walk this hierarchy to answer queries that would be >> one-line SQL queries.) >> >> To elaborate slightly on the problem, the goal would be to write vectors >> living on a DMComposite, with extra semantics like time step and units, >> in a way that could be used for visualization as well as checkpoints for >> forward and adjoint models. ?PETSc's unadorned binary IO is fine if the >> same code is going to read it back in, because everything will be wired >> up correctly and we're just loading into a Vec (although it's already >> somewhat tricky when the layout changes in the unstructured case). ?But >> there just isn't enough metadata to operate on in any sort of generic >> way, and I hate writing custom code to describe meshes and relations >> between them. >> >> Current scientific data formats (at least those I have seen) are a >> hassle to use since they have poor support for expressing relations. >> HDF5 has the equivalent of file-system symlinks, but after >> normalization, all the relations end up being encoded as a bunch of >> symlinks, which is a relatively low-level view and isn't a particularly >> convenient thing to traverse when answering a query. >> >> So I'm curious if anyone has put such metadata into a relational >> database instead of trying to contort it into one of these "scientific" >> data formats. ?My thought would be to drop only the metadata into >> something like Sqlite, and write the arrays themselves using MPI-IO (or >> HDF5/NetCDF/whatever, but these don't provide much when we aren't using >> them for metadata). ?This would allow efficient support of queries like >> "all vector fields at step M" and "fields B and C from step M to N on >> subdomains intersecting bounding box XYZ". ?This isn't completely >> different from what XDMF tries to do, but experimentation with that left >> a sour taste. ?Is SQL a stupid idea for this purpose and I'd be better >> off writing code to support the queries I want on HDF5/XDMF/something >> else? >> >> Jed > >
