Re: [Pytables-users] Follow-up on nested structures

Jan Strube Mon, 21 Aug 2006 19:06:35 -0700

Hi Francesc,

thanks a lot for your help with this.
I decided to go with the following approach:

Every Particle really only ever has 3 daughters, each of which looks different from the parent.

So the best structure to use is a nested tuple:
Now:

Please have a look at the attached file. running tableBug.py causes the following error on my machine:

Traceback (most recent call last):
 File "tableBug.py", line 8, in ?
   r.getTable().col('baryon2/uid').ndim

File "/u1/local/lib/python2.4/site-packages/tables/Table.py", line 1330, in col

   return self.read(field=name)

File "/u1/local/lib/python2.4/site-packages/tables/Table.py", line 1160, in read

   self._checkColumn(field)

File "/u1/local/lib/python2.4/site-packages/tables/Table.py", line 711, in _checkColumn

   raise KeyError(

KeyError: 'table ``/events/B_candidates`` does not have a column named ``baryon2/uid``'

Closing remaining opened files...  x.h5... done.


Now the behavior of this is very strange:
a) removing the pos= statements in X.py results in no crash

b) baryon1 and baryon2 are of the same type, but baryon1/uid exists, while baryon2/uid does not. c) removing the two statements to access the ndim of the column (meaning just doing the filling) results in no crash. (At least here)

For now I am going with solution a) to avoid the crash, but I'm afraid there is something bad happening underneath.


I am using pytables 1.3.2 with HDF5 1.6.5

I hope you can clarify this.

Thanks,
   Jan




Francesc Altet wrote:

Hello Jan,

A Divendres 11 Agost 2006 04:22, Jan Strube va escriure:
I would like to follow up on the message that was posted last month here
http://sourceforge.net/mailarchive/forum.php?thread_id=19437345&forum_id=13
760

My problem is a bit differently from the one presented there:

I have data in the following structure, representing a reconstructed
particle decay

Event
        List of Particles (variable number of entries in each event)
                3 daughters
                        momentum
                momentum
                other vectorial and scalar quantities

I hope the idea is clear from this clumsy description.
Well, it is not clear to me if you will always have 3 daughters or can have more (or less). Anyway, from what you exposes later on, I'll assume that you can have a variable number of daughters.
>From the example in the previous thread I understand how I can model

_one_ particle. However, I would like to be able to group the particles
by event.
The reason is that there can be only one particle per event, but I have
reconstructed many and don't know which one is the right one.

So I can think of two ways to implement this:
1) Simply do the same thing as in the previous thread, but add s.th.
like event_id to the particle.
2) Since persistence of an array of arbitrary classes is not supported,
I could unroll every member in the event, so that I
  a) have to find out the largest possible number of particles in an
event
  b) create for each member a FloatCol(shape=max)
In that case the event would look like this:
Event
        Particle_daughter_momentum_x = (list of max entries)
        Particle_daughter_momentum_y = (list of max entries)
        Particle_daughter_id = (list of max entries)
        Particle_mass = (list of max entries)
        Particle_daughter_mass = (list of max entries)
and so on and so forth for every member. max is then the highest number
of particles in any event.
There are other possibilities, like defining external EArrays (you can put them in the same group) with information of the entries in table that belongs to the same event. This adds a bit of complexity to your design, though.
1) has the advantage that it's easy to implement, but in principle I
would have to read over the whole dataset just to get the particles in
_one_ event. (I am thinking something
like: ...where(event_id==current_event) )

2) has the advantage that everything in the event is really grouped
together, but I am wondering i) about the waste of space, because I am
allocating the maximum number of particles for each event; ii) the
information is not really useful in the shape that I envision and I
would have to probably write something like an adaptor class to put the
information that belongs together back together.
Well, this is a nice problem indeed! So, after considering the best approach for a while, I'd definitely go for option 1). Also, in order to alleviate the times to access a certain event you can use the indexing capabilities of PyTables (to avoid traversing the entire table in order to lookup the intersting event). I've implemented some code (attached) in order to check the implementation and also to assess the speed-up that you can expect by using indexing. I've used a system with a pentium 4 @ 2 GHz processor for getting all the figures below.
Regarding implementation, I've defined a table with the next structure:

# Particle description
class Particle(IsDescription):
    event_id = IntCol()               # event id
    particle_id = IntCol()               # particle id in the event
    parent_id = IntCol()               # the id of the parent particle
                                     # (negative values means no parent)
    momentum = FloatCol(shape=3)    # momentum of the particle
    mass = FloatCol()             # mass of the particle
where you can notice the field 'event_id' that you suggested in 1). Also, note that I've added a 'parent_id' column that will tell whether the particle is a primary one (negative value) or derived from other (positive value). I think this is a quite flexible structure (yet very straightforward) for dealing with your problem.
After filling a table with 1 million particles aprox. (20000 events with a maximum of 100 particles each, but this number changing from event to event), I've defined a series of operations for processing the data. One example is "determining the sum of module of momentum for daughter particles of particle X in event Y". Here is the code for it
smomentum = 0.0
for row in table.where(event_col == 34):
    if row['parent_id'] == 3:
        smomentum += sqrt(add.reduce(row['momentum']**2))
print smomentum

or, if you like to use generator expressions (will need Python 2.4):

print sum(sqrt(add.reduce(row['momentum']**2))
          for row in table.where(event_col == 34)
          if row['parent_id'] == 3)

With this, I've run a series of benchmarks. First, without using indexes:

Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 55.406 sec
Done --- Time: 55.487 seconds
Selecting events...
Particles in event 34: 17
Done --- Time: 8.903 seconds
Root particles in event 34: 10
Done --- Time: 2.533 seconds
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 2.255 seconds
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 2.268 seconds
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 2.568 seconds
As you can see, looking up to event 34 takes quite a lot of time (~9 sec) the first time that you access to it. After the first access, successive operations for the same event suffer an speed-up of more than 4x. This is due to cache (OS, HDF5 and PyTables itself caches) issues.
When indexing the 'event_id' column, that is, replacing the line:

event_id    = IntCol()
by
event_id    = IntCol(indexed=True)

in the description, we get:

Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 57.756 sec
Selecting events...
Particles in event 34: 17
Done --- Time: 0.385 sec
Root particles in event 34: 10
Done --- Time: 0.16 sec
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 0.158 sec
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 0.161 sec
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 0.256 sec
So, you get a factor over 10x of additional speed-up over the former case. This speed is normally more than enough for interactive purposes (the response is almost immediate for humans, specailly after the first iteration).
Just out of curiosity and if more speed is needed (for batch processes, for example), PyTables Pro (still in alpha) can push these times down even further:
Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 62.018 sec
Selecting events...
Particles in event 34: 17
Done --- Time: 0.059 sec
Root particles in event 34: 10
Done --- Time: 0.004 sec
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 0.003 sec
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 0.003 sec
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 0.004 sec
Which represents an additional ~10x to ~50x improvement over indexation in standard PyTables.
Anyway, I hope that this will convince you thar indexing is your friend as an effective and simple way to accelerate your look-ups (and hence, simplify your data schemas in many situations).
Cheers,


--
Jan F. Strube -- University of Oregon
Stanford Linear Accelerator Center
bldg. 48, rm. 244, MS 35
(650) 926-2913 phone
          8522 fax

tablebug.tar
Description: Unix tar archive

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Follow-up on nested structures

Reply via email to