On 11/2/2010 2:45 PM, Christopher Barker wrote:
On 11/2/10 6:09 AM, Wright, Bruce wrote:
Sorry for a late follow-up (and once again breaking the thread), but
below is some feedback from our guys running the particle trajectory
models at the Met Office, which I think highlight the difficulties
storing particle trajectories efficiently.

Thanks for the comments -- this supports what some conclusions had had been 
coming to:

In a long (multi-year) air quality or risk assessment run, the total
number of particles followed could be a thousand times the maximum
number existing at any one time ...That suggests that
padding out arrays to the total number of particles is not a sensible
option.

Agreed, I've decided that that's not the way to go.

... in
that it links particles arbitrarily according to whether they reuse the
same space).

right -- that really isn't an option -- yes the storage space can be re-used, 
but it wouldn't mean that a given space in the array meant anything.

An alternative is, at each time, to store the particle data and for this
to include a particle id, without attempting to link particles at
different times.

I think this is the way to go. In fact, I think the particle ID could be 
optional -- some applications don't keep an ID, and most post processing does 
care about the ID. However, an ID could be handy for linking particle 
properties that might be constant over time, but vary among particles, rather 
than storing the property over an over again.

However retrieving a trajectory is then difficult as
will have to search each time for the particle id required.

Yes, it would. My thought is that this is OK price to pay. In models that 
create and destroy particles, the trajectory of an individual particle is 
generally not of interest. Far more common is wanting to know about the 
collection of particles at a given time, so that's what should be easy to 
extract.

Storing
start and end time for each particle id would help, but restoring a
complete trajectory would still be inefficient. One can think of ways
round this: in a computer language one would have an array for each
particle id giving the indices in each time slice corresponding to the
particle (these arrays could be offset relative to the particle start
time so they would not have to be very long), and then an array of such
structures, one for each particle id. Can NetCDF do that?

Maybe, but the data can be re-constructed, so I wouldn't bother. Yes, it would 
require reading the whole file for one particles trajectory, but I don't think 
that's a common use case -- am I wrong? are folks likely to want to extract a 
particular particle's trajectory from a big data set?

To make things more difficult it might also be useful to store
trajectories with different length time-steps for different
trajectories.

So some particles are using a larger time step than others? This gets a bit 
ugly yes, and I can't think of a use case either. I suppose it's possible that 
a model could use smaller time steps for particles that are in regions with 
faster-changing or more complex current fields, but does any model do this? If 
so, I'd imagine it would be sub-timestep process (like the intermediate results 
in a R-K integrator), and you wouldn't need/want to store the smaller steps 
anyway.

For very long runs, one would probably not want to be forced to store
everything in one very large file.

yup. I don't think that's hard to accommodate.

I think it would be acceptable to have more than one format for storing
data with different methods being efficient for different retrieval
types, together with (slow) utilities for converting between these
formats. Indeed that might be preferable if it enables things to be kept
simple conceptually.

Maybe, but it seems that we can get one that fits the needs of everyone that 
has spoken up here, so that's a reasonable start.

-Chris


1) It seems clear that at each time step, you need to write out the data for 
whatever particles currently exist. I assume that if you wanted to break up the 
data for a very long run, you would partition by time, ie time steps 1-1000, 
1001-2000, etc. would be in seperate files.

2) Apparently the common read pattern is to retrieve the set of particles at a 
given time step. If so, that makes things easier.

3) I assume that you want to be able to figure out an individual particle's 
trajectory, even if that doesnt need to be optimized for speed.

Questions:
0) are those reasonable assumptions?
1) is the avg number (Navg) of particles that exist much smaller, or approx the 
same as the max number (Nmax) that exist at one time step?
2)  How much data is associated with each particle at a given time step (just 
an estimate needed here - 10 bytes? 1000 bytes?)

_______________________________________________
CF-metadata mailing list
[email protected]
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to