Re: [Pytables-users] Follow-up on nested structures

Francesc Altet Fri, 11 Aug 2006 09:29:36 -0700

Hello Jan,

A Divendres 11 Agost 2006 04:22, Jan Strube va escriure:
> I would like to follow up on the message that was posted last month here
> http://sourceforge.net/mailarchive/forum.php?thread_id=19437345&forum_id=13
>760
>
> My problem is a bit differently from the one presented there:
>
> I have data in the following structure, representing a reconstructed
> particle decay
>
> Event
>       List of Particles (variable number of entries in each event)
>               3 daughters
>                       momentum
>               momentum
>               other vectorial and scalar quantities
>
> I hope the idea is clear from this clumsy description.


Well, it is not clear to me if you will always have 3 daughters or can have 
more (or less). Anyway, from what you exposes later on, I'll assume that you 
can have a variable number of daughters.

> >From the example in the previous thread I understand how I can model
>
> _one_ particle. However, I would like to be able to group the particles
> by event.
> The reason is that there can be only one particle per event, but I have
> reconstructed many and don't know which one is the right one.
>
> So I can think of two ways to implement this:
> 1) Simply do the same thing as in the previous thread, but add s.th.
> like event_id to the particle.
> 2) Since persistence of an array of arbitrary classes is not supported,
> I could unroll every member in the event, so that I
>   a) have to find out the largest possible number of particles in an
> event
>   b) create for each member a FloatCol(shape=max)
> In that case the event would look like this:
> Event
>       Particle_daughter_momentum_x = (list of max entries)
>       Particle_daughter_momentum_y = (list of max entries)
>       Particle_daughter_id = (list of max entries)
>       Particle_mass = (list of max entries)
>       Particle_daughter_mass = (list of max entries)
> and so on and so forth for every member. max is then the highest number
> of particles in any event.

There are other possibilities, like defining external EArrays (you can put 
them in the same group) with information of the entries in table that belongs 
to the same event. This adds a bit of complexity to your design, though.

> 1) has the advantage that it's easy to implement, but in principle I
> would have to read over the whole dataset just to get the particles in
> _one_ event. (I am thinking something
> like: ...where(event_id==current_event) )
>
> 2) has the advantage that everything in the event is really grouped
> together, but I am wondering i) about the waste of space, because I am
> allocating the maximum number of particles for each event; ii) the
> information is not really useful in the shape that I envision and I
> would have to probably write something like an adaptor class to put the
> information that belongs together back together.

Well, this is a nice problem indeed! So, after considering the best approach 
for a while, I'd definitely go for option 1). Also, in order to alleviate the 
times to access a certain event you can use the indexing capabilities of 
PyTables (to avoid traversing the entire table in order to lookup the 
intersting event). I've implemented some code (attached) in order to check 
the implementation and also to assess the speed-up that you can expect by 
using indexing. I've used a system with a pentium 4 @ 2 GHz processor for 
getting all the figures below.

Regarding implementation, I've defined a table with the next structure:

# Particle description
class Particle(IsDescription):
    event_id = IntCol()               # event id
    particle_id = IntCol()               # particle id in the event
    parent_id = IntCol()               # the id of the parent particle
                                     # (negative values means no parent)
    momentum = FloatCol(shape=3)    # momentum of the particle
    mass = FloatCol()             # mass of the particle

where you can notice the field 'event_id' that you suggested in 1). Also, note 
that I've added a 'parent_id' column that will tell whether the particle is a 
primary one (negative value) or derived from other (positive value). I think 
this is a quite flexible structure (yet very straightforward) for dealing 
with your problem.

After filling a table with 1 million particles aprox. (20000 events with a 
maximum of 100 particles each, but this number changing from event to event), 
I've defined a series of operations for processing the data. One example 
is "determining the sum of module of momentum for daughter particles of 
particle X in event Y". Here is the code for it

smomentum = 0.0
for row in table.where(event_col == 34):
    if row['parent_id'] == 3:
        smomentum += sqrt(add.reduce(row['momentum']**2))
print smomentum

or, if you like to use generator expressions (will need Python 2.4):

print sum(sqrt(add.reduce(row['momentum']**2))
          for row in table.where(event_col == 34)
          if row['parent_id'] == 3)

With this, I've run a series of benchmarks. First, without using indexes:

Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 55.406 sec
Done --- Time: 55.487 seconds
Selecting events...
Particles in event 34: 17
Done --- Time: 8.903 seconds
Root particles in event 34: 10
Done --- Time: 2.533 seconds
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 2.255 seconds
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 2.268 seconds
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 2.568 seconds

As you can see, looking up to event 34 takes quite a lot of time (~9 sec) the 
first time that you access to it. After the first access, successive 
operations for the same event suffer an speed-up of more than 4x. This is due 
to cache (OS, HDF5 and PyTables itself caches) issues.

When indexing the 'event_id' column, that is, replacing the line:

event_id    = IntCol()
by
event_id    = IntCol(indexed=True)

in the description, we get:

Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 57.756 sec
Selecting events...
Particles in event 34: 17
Done --- Time: 0.385 sec
Root particles in event 34: 10
Done --- Time: 0.16 sec
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 0.158 sec
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 0.161 sec
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 0.256 sec

So, you get a factor over 10x of additional speed-up over the former case. 
This speed is normally more than enough for interactive purposes (the 
response is almost immediate for humans, specailly after the first 
iteration).

Just out of curiosity and if more speed is needed (for batch processes, for 
example), PyTables Pro (still in alpha) can push these times down even 
further:

Creating table with 1000000 entries aprox.. Wait please...
Added 988827 entries --- Time: 62.018 sec
Selecting events...
Particles in event 34: 17
Done --- Time: 0.059 sec
Root particles in event 34: 10
Done --- Time: 0.004 sec
Sum of masses of root particles in event 34: 5003.29466721
Done --- Time: 0.003 sec
Sum of masses of daughter particles for particle 3 in event 34: 489.220871968
Done --- Time: 0.003 sec
Sum of module of momentum for particle 3 in event 34: 6.38388912094
Done --- Time: 0.004 sec

Which represents an additional ~10x to ~50x improvement over indexation in 
standard PyTables.

Anyway, I hope that this will convince you thar indexing is your friend as an  
effective and simple way to accelerate your look-ups (and hence, simplify 
your data schemas in many situations).

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

particles.py
Description: application/python

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Follow-up on nested structures

Reply via email to