On Thursday 31 January 2008 01:15:46 pm Francesc Altet wrote:
> A Thursday 31 January 2008, escriguéreu:
> > It looks like pytables already works well with existing hard links. I
> > wanted to ask though if there might be any caching issues lurking
> > beneath the surface that I should investigate. An example data set is
> > attached (created with the nexus library using Paul Kienzle's python
> > wrapper) where t.root.entry.r8_data is a 5x4 64bit float array and
> > t.root.link.renLinkData is a hard link to the same. I can modify the
> > array following either path, and support looks completely
> > transparent, but I just wanted to be sure.
>
> I've been talking with Ivan in that respect, and we have come to the
> conclusion that implementing links in PyTables is much more hairy than
> I anticipated.  The problem is mainly with metadata coherency (there
> should not be problems with data itself, as you have checked it; maybe
> tables, that have I/O buffers, could have some but in rather
> exceptional cases).
>
> As you probably know, PyTables does a lot of effort to caching metadata
> in order to accelerate the access to metadata (this is why it is so
> efficient when handling potentially large hierarchies, see [1]_).  The
> metadata that is cached is basically found at three points:
>
> * Node objects
> * AttributeSet objects
> * Indexes objects (this is a special case for the Pro version, but
> important to us).
>
> Here it is a couple of examples of the kind of problems that can be
> seen. Firstly, problems with the node cache:
[...]

I didn't think to check these, I see they are quite serious.

> These metadata cache coherency issues is pretty difficult to solve, as
> we need to rethink all the structure of PyTables to include the new
> issues in the schema, because the current one is just not thought to
> deal with this.
>
> Another additional problem is that it seems that HDF5 does allow hard
> links to Groups, so introducing the possibility to create 'loops' in
> the hierarchy.  Of course, suporting this in PyTables introduces more
> complexity, but, in a first approximation, we could 'disable' this
> feature, so I'll skip the discussion of the issues in that regard.
>
> Going back to the cache coherency problem, a key aspect for solving it
> would be how to uniquely determine the data area of a node on disk
> (i.e. the equivalent of a 'inode' in a filesystem), and take this
> identifier as the new 'primary key' for the node cache (right now, this
> role is played by the node path, but this is precisely what introduces
> the cache coherency problem).  A possible candidate for playing
> this 'primary key' role would be the '_v_objectID' node attribute, but
> unfortunately HDF5 returns different IDs for links pointing to the
> same 'inode':
>
> In [77]: f.root.entry.sample._v_objectID
> Out[77]: 134217739
>
> In [78]: f.root.link.renLinkGroup._v_objectID
> Out[78]: 134217748
>
> Ummm, I will ask to the HDF5 mailing list if it would be possible to get
> a unique identifier for all the links pointing to same data area.  If
> HDF5 can provide such a identifier, the next step should be to rethink
> the structure of the metadata cache in PyTables and implement a new one
> based on the 'inode' concept, instead of the 'node path' one, which
> certainly is not a trivial task (to say the least).

I saw your post, hopefully there is some support in HDF5 on which to build.

> > It looks like rounding out support for hard links would simply
> > require adding a new method to File to create the link. I propose
> > something like
> >
> > File.linkNode(self, where, name, curObject)
> > or
> > File.createLink(self, where, name, curObject)
> >
> > The argument list here follows the pattern in createTable.
>
> Yeah, I like both.  Perhaps the 'createLink' flavor is more consistent
> with the 'actionNode' pattern that is used in other constructors.
>
> > Soft links would take more work. I don't think I would use them
> > myself, so I probably am the wrong person to suggest their
> > implementation. Maybe they would require a new pytables object
> > deriving from Leaf, I don't really know how such a thing should
> > behave. They could be added later, and be created with the same file
> > method through the addition of a linktype kwarg that defaults to hard
> > links.
>
> Curiously enough, Ivan and me think that 'soft' links would be far more
> cheaper to implement in current PyTables than their 'hard'
> counterparts.  This is due to the fact that metadata 'primary keys' in
> our object tree cache could continue to be based on 'node paths'
> instead of 'inodes'.  Still, there is the problem of metadata cache
> coherency, but we could think about maintaining lists of 'soft links'
> pointing to each 'real' node, so that we can update them on each
> modification of the real node.  Or perhaps, implementing the 'soft
> links' using a proxy pattern would be more than enough.  These are
> ideas from the top of my head, but we should think more about that
> anyway.
>
> Well, sorry for not being able of anticipating so much difficulties
> before.

Not at all. I hope that once we hear back from the hdf5 list, there might be 
some way forward.

Darren

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to