On Thursday 31 January 2008 01:15:46 pm Francesc Altet wrote: > A Thursday 31 January 2008, escriguéreu: > > It looks like pytables already works well with existing hard links. I > > wanted to ask though if there might be any caching issues lurking > > beneath the surface that I should investigate. An example data set is > > attached (created with the nexus library using Paul Kienzle's python > > wrapper) where t.root.entry.r8_data is a 5x4 64bit float array and > > t.root.link.renLinkData is a hard link to the same. I can modify the > > array following either path, and support looks completely > > transparent, but I just wanted to be sure. > > I've been talking with Ivan in that respect, and we have come to the > conclusion that implementing links in PyTables is much more hairy than > I anticipated. The problem is mainly with metadata coherency (there > should not be problems with data itself, as you have checked it; maybe > tables, that have I/O buffers, could have some but in rather > exceptional cases). > > As you probably know, PyTables does a lot of effort to caching metadata > in order to accelerate the access to metadata (this is why it is so > efficient when handling potentially large hierarchies, see [1]_). The > metadata that is cached is basically found at three points: > > * Node objects > * AttributeSet objects > * Indexes objects (this is a special case for the Pro version, but > important to us). > > Here it is a couple of examples of the kind of problems that can be > seen. Firstly, problems with the node cache: [...]
I didn't think to check these, I see they are quite serious. > These metadata cache coherency issues is pretty difficult to solve, as > we need to rethink all the structure of PyTables to include the new > issues in the schema, because the current one is just not thought to > deal with this. > > Another additional problem is that it seems that HDF5 does allow hard > links to Groups, so introducing the possibility to create 'loops' in > the hierarchy. Of course, suporting this in PyTables introduces more > complexity, but, in a first approximation, we could 'disable' this > feature, so I'll skip the discussion of the issues in that regard. > > Going back to the cache coherency problem, a key aspect for solving it > would be how to uniquely determine the data area of a node on disk > (i.e. the equivalent of a 'inode' in a filesystem), and take this > identifier as the new 'primary key' for the node cache (right now, this > role is played by the node path, but this is precisely what introduces > the cache coherency problem). A possible candidate for playing > this 'primary key' role would be the '_v_objectID' node attribute, but > unfortunately HDF5 returns different IDs for links pointing to the > same 'inode': > > In [77]: f.root.entry.sample._v_objectID > Out[77]: 134217739 > > In [78]: f.root.link.renLinkGroup._v_objectID > Out[78]: 134217748 > > Ummm, I will ask to the HDF5 mailing list if it would be possible to get > a unique identifier for all the links pointing to same data area. If > HDF5 can provide such a identifier, the next step should be to rethink > the structure of the metadata cache in PyTables and implement a new one > based on the 'inode' concept, instead of the 'node path' one, which > certainly is not a trivial task (to say the least). I saw your post, hopefully there is some support in HDF5 on which to build. > > It looks like rounding out support for hard links would simply > > require adding a new method to File to create the link. I propose > > something like > > > > File.linkNode(self, where, name, curObject) > > or > > File.createLink(self, where, name, curObject) > > > > The argument list here follows the pattern in createTable. > > Yeah, I like both. Perhaps the 'createLink' flavor is more consistent > with the 'actionNode' pattern that is used in other constructors. > > > Soft links would take more work. I don't think I would use them > > myself, so I probably am the wrong person to suggest their > > implementation. Maybe they would require a new pytables object > > deriving from Leaf, I don't really know how such a thing should > > behave. They could be added later, and be created with the same file > > method through the addition of a linktype kwarg that defaults to hard > > links. > > Curiously enough, Ivan and me think that 'soft' links would be far more > cheaper to implement in current PyTables than their 'hard' > counterparts. This is due to the fact that metadata 'primary keys' in > our object tree cache could continue to be based on 'node paths' > instead of 'inodes'. Still, there is the problem of metadata cache > coherency, but we could think about maintaining lists of 'soft links' > pointing to each 'real' node, so that we can update them on each > modification of the real node. Or perhaps, implementing the 'soft > links' using a proxy pattern would be more than enough. These are > ideas from the top of my head, but we should think more about that > anyway. > > Well, sorry for not being able of anticipating so much difficulties > before. Not at all. I hope that once we hear back from the hdf5 list, there might be some way forward. Darren ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users