A Thursday 31 January 2008, escriguéreu:
> It looks like pytables already works well with existing hard links. I
> wanted to ask though if there might be any caching issues lurking
> beneath the surface that I should investigate. An example data set is
> attached (created with the nexus library using Paul Kienzle's python
> wrapper) where t.root.entry.r8_data is a 5x4 64bit float array and
> t.root.link.renLinkData is a hard link to the same. I can modify the
> array following either path, and support looks completely
> transparent, but I just wanted to be sure.

I've been talking with Ivan in that respect, and we have come to the 
conclusion that implementing links in PyTables is much more hairy than 
I anticipated.  The problem is mainly with metadata coherency (there 
should not be problems with data itself, as you have checked it; maybe 
tables, that have I/O buffers, could have some but in rather 
exceptional cases).

As you probably know, PyTables does a lot of effort to caching metadata 
in order to accelerate the access to metadata (this is why it is so 
efficient when handling potentially large hierarchies, see [1]_).  The 
metadata that is cached is basically found at three points:

* Node objects
* AttributeSet objects
* Indexes objects (this is a special case for the Pro version, but 
important to us).

Here it is a couple of examples of the kind of problems that can be 
seen. Firstly, problems with the node cache:

In [72]: f.root.entry.sample
Out[72]:
/entry/sample (Group) ''
  children := ['ch_data' (Array)]

In [73]: f.root.link.renLinkGroup
Out[73]:
/link/renLinkGroup (Group) ''
  children := ['ch_data' (Array)]  # it's a link to '/entry/sample'

In [74]: new_node=f.createArray(f.root.link.renLinkGroup, "new_array", 
[1,2])

In [75]: f.root.link.renLinkGroup
Out[75]:
/link/renLinkGroup (Group) ''
  children := ['new_array' (Array), 'ch_data' (Array)]

In [76]: f.root.entry.sample
Out[76]:
/entry/sample (Group) ''
  children := ['ch_data' (Array)]

where you can see that the 'new_array' node is missing in 'sample' (!).

Secondly, problems with the attribute metadata cache:

In [51]: f.root.entry.r8_data.attrs
Out[51]:
/entry/r8_data._v_attrs (AttributeSet), 4 attributes:
   [ch_attribute := 'NeXus',
    i4_attribute := 42,
    r4_attribute := 3.14159274101,
    target := '/entry/r8_data']

In [52]: f.root.link.renLinkData.attrs
Out[52]:
/link/renLinkData._v_attrs (AttributeSet), 4 attributes:
   [ch_attribute := 'NeXus',
    i4_attribute := 42,
    r4_attribute := 3.14159274101,
    target := '/entry/r8_data']

In [53]: f.root.link.renLinkData.attrs.userattr = "a test"

In [54]: f.root.link.renLinkData.attrs
Out[54]:
/link/renLinkData._v_attrs (AttributeSet), 5 attributes:
   [ch_attribute := 'NeXus',
    i4_attribute := 42,
    r4_attribute := 3.14159274101,
    target := '/entry/r8_data',
    userattr := 'a test']

In [55]: f.root.entry.r8_data.attrs
Out[55]:
/entry/r8_data._v_attrs (AttributeSet), 4 attributes:
   [ch_attribute := 'NeXus',
    i4_attribute := 42,
    r4_attribute := 3.14159274101,
    target := '/entry/r8_data']

Note that 'userattr' attribute is missing in 'r8_data' node.

I'll skip the discussion of the problems with indexes, as the already 
mentioned are more than enough to show the point.

These metadata cache coherency issues is pretty difficult to solve, as 
we need to rethink all the structure of PyTables to include the new 
issues in the schema, because the current one is just not thought to 
deal with this.

Another additional problem is that it seems that HDF5 does allow hard 
links to Groups, so introducing the possibility to create 'loops' in 
the hierarchy.  Of course, suporting this in PyTables introduces more 
complexity, but, in a first approximation, we could 'disable' this 
feature, so I'll skip the discussion of the issues in that regard.

Going back to the cache coherency problem, a key aspect for solving it 
would be how to uniquely determine the data area of a node on disk 
(i.e. the equivalent of a 'inode' in a filesystem), and take this 
identifier as the new 'primary key' for the node cache (right now, this 
role is played by the node path, but this is precisely what introduces 
the cache coherency problem).  A possible candidate for playing 
this 'primary key' role would be the '_v_objectID' node attribute, but 
unfortunately HDF5 returns different IDs for links pointing to the 
same 'inode':

In [77]: f.root.entry.sample._v_objectID
Out[77]: 134217739

In [78]: f.root.link.renLinkGroup._v_objectID
Out[78]: 134217748

Ummm, I will ask to the HDF5 mailing list if it would be possible to get 
a unique identifier for all the links pointing to same data area.  If 
HDF5 can provide such a identifier, the next step should be to rethink 
the structure of the metadata cache in PyTables and implement a new one 
based on the 'inode' concept, instead of the 'node path' one, which 
certainly is not a trivial task (to say the least).

> It looks like rounding out support for hard links would simply
> require adding a new method to File to create the link. I propose
> something like
>
> File.linkNode(self, where, name, curObject)
> or
> File.createLink(self, where, name, curObject)
>
> The argument list here follows the pattern in createTable.

Yeah, I like both.  Perhaps the 'createLink' flavor is more consistent 
with the 'actionNode' pattern that is used in other constructors.

> Soft links would take more work. I don't think I would use them
> myself, so I probably am the wrong person to suggest their
> implementation. Maybe they would require a new pytables object
> deriving from Leaf, I don't really know how such a thing should
> behave. They could be added later, and be created with the same file
> method through the addition of a linktype kwarg that defaults to hard
> links.

Curiously enough, Ivan and me think that 'soft' links would be far more 
cheaper to implement in current PyTables than their 'hard' 
counterparts.  This is due to the fact that metadata 'primary keys' in 
our object tree cache could continue to be based on 'node paths' 
instead of 'inodes'.  Still, there is the problem of metadata cache 
coherency, but we could think about maintaining lists of 'soft links' 
pointing to each 'real' node, so that we can update them on each 
modification of the real node.  Or perhaps, implementing the 'soft 
links' using a proxy pattern would be more than enough.  These are 
ideas from the top of my head, but we should think more about that 
anyway.

Well, sorry for not being able of anticipating so much difficulties 
before.

..[1] http://www.carabos.com/downloads/resources/NewObjectTreeCache.pdf

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to