A Thursday 31 January 2008, escriguéreu: > It looks like pytables already works well with existing hard links. I > wanted to ask though if there might be any caching issues lurking > beneath the surface that I should investigate. An example data set is > attached (created with the nexus library using Paul Kienzle's python > wrapper) where t.root.entry.r8_data is a 5x4 64bit float array and > t.root.link.renLinkData is a hard link to the same. I can modify the > array following either path, and support looks completely > transparent, but I just wanted to be sure.
I've been talking with Ivan in that respect, and we have come to the conclusion that implementing links in PyTables is much more hairy than I anticipated. The problem is mainly with metadata coherency (there should not be problems with data itself, as you have checked it; maybe tables, that have I/O buffers, could have some but in rather exceptional cases). As you probably know, PyTables does a lot of effort to caching metadata in order to accelerate the access to metadata (this is why it is so efficient when handling potentially large hierarchies, see [1]_). The metadata that is cached is basically found at three points: * Node objects * AttributeSet objects * Indexes objects (this is a special case for the Pro version, but important to us). Here it is a couple of examples of the kind of problems that can be seen. Firstly, problems with the node cache: In [72]: f.root.entry.sample Out[72]: /entry/sample (Group) '' children := ['ch_data' (Array)] In [73]: f.root.link.renLinkGroup Out[73]: /link/renLinkGroup (Group) '' children := ['ch_data' (Array)] # it's a link to '/entry/sample' In [74]: new_node=f.createArray(f.root.link.renLinkGroup, "new_array", [1,2]) In [75]: f.root.link.renLinkGroup Out[75]: /link/renLinkGroup (Group) '' children := ['new_array' (Array), 'ch_data' (Array)] In [76]: f.root.entry.sample Out[76]: /entry/sample (Group) '' children := ['ch_data' (Array)] where you can see that the 'new_array' node is missing in 'sample' (!). Secondly, problems with the attribute metadata cache: In [51]: f.root.entry.r8_data.attrs Out[51]: /entry/r8_data._v_attrs (AttributeSet), 4 attributes: [ch_attribute := 'NeXus', i4_attribute := 42, r4_attribute := 3.14159274101, target := '/entry/r8_data'] In [52]: f.root.link.renLinkData.attrs Out[52]: /link/renLinkData._v_attrs (AttributeSet), 4 attributes: [ch_attribute := 'NeXus', i4_attribute := 42, r4_attribute := 3.14159274101, target := '/entry/r8_data'] In [53]: f.root.link.renLinkData.attrs.userattr = "a test" In [54]: f.root.link.renLinkData.attrs Out[54]: /link/renLinkData._v_attrs (AttributeSet), 5 attributes: [ch_attribute := 'NeXus', i4_attribute := 42, r4_attribute := 3.14159274101, target := '/entry/r8_data', userattr := 'a test'] In [55]: f.root.entry.r8_data.attrs Out[55]: /entry/r8_data._v_attrs (AttributeSet), 4 attributes: [ch_attribute := 'NeXus', i4_attribute := 42, r4_attribute := 3.14159274101, target := '/entry/r8_data'] Note that 'userattr' attribute is missing in 'r8_data' node. I'll skip the discussion of the problems with indexes, as the already mentioned are more than enough to show the point. These metadata cache coherency issues is pretty difficult to solve, as we need to rethink all the structure of PyTables to include the new issues in the schema, because the current one is just not thought to deal with this. Another additional problem is that it seems that HDF5 does allow hard links to Groups, so introducing the possibility to create 'loops' in the hierarchy. Of course, suporting this in PyTables introduces more complexity, but, in a first approximation, we could 'disable' this feature, so I'll skip the discussion of the issues in that regard. Going back to the cache coherency problem, a key aspect for solving it would be how to uniquely determine the data area of a node on disk (i.e. the equivalent of a 'inode' in a filesystem), and take this identifier as the new 'primary key' for the node cache (right now, this role is played by the node path, but this is precisely what introduces the cache coherency problem). A possible candidate for playing this 'primary key' role would be the '_v_objectID' node attribute, but unfortunately HDF5 returns different IDs for links pointing to the same 'inode': In [77]: f.root.entry.sample._v_objectID Out[77]: 134217739 In [78]: f.root.link.renLinkGroup._v_objectID Out[78]: 134217748 Ummm, I will ask to the HDF5 mailing list if it would be possible to get a unique identifier for all the links pointing to same data area. If HDF5 can provide such a identifier, the next step should be to rethink the structure of the metadata cache in PyTables and implement a new one based on the 'inode' concept, instead of the 'node path' one, which certainly is not a trivial task (to say the least). > It looks like rounding out support for hard links would simply > require adding a new method to File to create the link. I propose > something like > > File.linkNode(self, where, name, curObject) > or > File.createLink(self, where, name, curObject) > > The argument list here follows the pattern in createTable. Yeah, I like both. Perhaps the 'createLink' flavor is more consistent with the 'actionNode' pattern that is used in other constructors. > Soft links would take more work. I don't think I would use them > myself, so I probably am the wrong person to suggest their > implementation. Maybe they would require a new pytables object > deriving from Leaf, I don't really know how such a thing should > behave. They could be added later, and be created with the same file > method through the addition of a linktype kwarg that defaults to hard > links. Curiously enough, Ivan and me think that 'soft' links would be far more cheaper to implement in current PyTables than their 'hard' counterparts. This is due to the fact that metadata 'primary keys' in our object tree cache could continue to be based on 'node paths' instead of 'inodes'. Still, there is the problem of metadata cache coherency, but we could think about maintaining lists of 'soft links' pointing to each 'real' node, so that we can update them on each modification of the real node. Or perhaps, implementing the 'soft links' using a proxy pattern would be more than enough. These are ideas from the top of my head, but we should think more about that anyway. Well, sorry for not being able of anticipating so much difficulties before. ..[1] http://www.carabos.com/downloads/resources/NewObjectTreeCache.pdf Cheers, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users