David Nickerson <[EMAIL PROTECTED]> writes: > On Sat, Nov 15, 2008 at 3:55 PM, Jon Olav Vik <[EMAIL PROTECTED]> wrote: [snip] > > Yes, but there needs to be *some* way of identifying the information in the > > HDF5 file, like "using parameter values as indexes". A purist solution might be > > to have each simulation result annotated with the URI for that particular > > parameter set and model. However, any analysis would then require running back > > and forth between the CellML model (DOM API, metadata, ...) and the huge output > > files (e.g. HDF5). Until the CellML tools (DOM, code generation, ...) fit > > seamlessly into more mainstream tools, I'd prefer not to lug around the CellML > > DOM API everywhere I take my data. (No offense. > > but doesn't this get back to the issue of needing all the information > like units in the HDF5 data file also?
First of all, thanks for a very constructive answer. > If you'd prefer to have an HDF5 > data file that can be unambiguously interpreted without reference to > the source CellML models and/or simulations, No, that's not what I meant. The *required* information in the HDF5 file would be something like: a) unambiguous identification of model b) unambiguous specification of parameter values c) unambiguous identification of output variables For a), the URI to the model should be an ideal "canonical" reference (but I'd still appreciate a human-readable, (autogenerated) text-formatted reference as annotation). For b), the URI to a CellML simulation spec might be an ideal canonical reference. For c), I guess the HDF5 array names should simply be the variable names from the CellML model. > then that is a whole lot > more data that would need to be in the data file.... I do intend for the HDF5 file to be interpreted in conjunction with the CellML model. However, things like a human-readable citation in addition to the URI would be *convenient*, as would a copy of the parameter values used in the simulation. Regarding redundancy, it would be understood that the *official* parameter values were those in the CellML simulation specification. Consistency could be verified automatically (e.g. by a CRC checksum) whenever desired. It would be an error to change those parameter values after that part of the HDF5 structure was begun. Regarding storage space, the 180 kB required for e.g. the Bondarenko model are negligible compared to the megabytes of output for even a modest exploration of parameter combinations. > if you want to interpret the data in the file without needing to go > back and forth with the CellML models then I'd guess you probably want > to add some tool-specific data to the HDF5 group that gets generated > by the proposed tool/service...or not. Yes, that's about what I had in mind. Similar in some ways to the various tabs on the repository webpages of cellml.org, or the autogenerated code in different programming languages: Certainly redundant, but convenient. > Maybe the below has convinced > me that this could be done in a nice way... > > > I was thinking of this extra annotation as "write once, read many", just > > labelling the boxes. There exist external tools for exploring HDF5 files, > > http://www.hdfgroup.org/hdf-java-html/hdfview/ > > and these will be a lot less useful if the data structure doesn't indicate > > which parameters a result is for. (That said, it might be useful to verify the > > integrity of the link between model, parameters and output e.g. by some kind of > > hashing.) > > This sounds more like you are after a complete translation of the > source models and simulations into HDF5. I don't know about "into" HDF5; that's just a vehicle for storing numbers, has no concept of mathematical functions, etc. (...and I know that you know this better than I do 8-) I'm just after human-readable annotation to aid in navigation and exploration of the data. (Well, maybe some machine-readable annotation too; units would be nice to give meaning to the numbers.) > For a given model you'd have > a list of all the "unique" variables in the model annotated with a > string containing the full expansion of the variable's units into the > set of base units, and the variable's value field - which would be a > scalar for constant parameters and an array for dynamic variables. I > guess you'd also want some kind of reference to the index field (i.e., > time). Not sure if you'd also want to keep track of all the actual > variables in the model that are used for each of the unique variables > in the simulation instantiation, but that could be done. > > In such a tool you'd still lose a lot of the annotation in the source > CellML models. But I guess if you simply want an optimised data store > the above should give you everything you need and if required in > special cases you can also link back to the CellML models as there > should still be some URI's stored somewhere in the HDF5 data file. Of > course, if you want to do all this nice and quickly you'd likely > ignore the units anyway if you know that all your simulations are in > compatible or identical units so they can be left back in the CellML > model and can be looked up if needed. To me this sounds very promising. I'd be interested to hear what others think. > One consideration with such a solution is that I have found the HDF5 > packet table interface to be about the most efficient way to stream > simulation data to a persistent store. I have one packet table per > simulation and use the model variable URI's to set up a mapping into > that packet table for each dynamic variable. So rather than using the > variable field of dynamic variables for an array, it is probably more > efficient to set it up as an index or something into the packet > table....sounds like it should be workable :) I must admit I do not yet know what a "packet table" is. I think it might be very helpful if you could write up a toy example of how you currently use HDF5 with CellML models. [As for myself and my short-term hacks, I'm currently leaning towards Numpy ndarrays or recarrays (allowing reference to array "columns" by name) stored in HDF5 via pytables. http://thread.gmane.org/gmane.comp.python.numeric.general/22250/ ] Best regards, Jon Olav _______________________________________________ cellml-discussion mailing list [email protected] http://www.cellml.org/mailman/listinfo/cellml-discussion
