[Numpy-discussion] Yet another axes naming package [Was: Re: BOF notes: Fernando's proposal: NumPy ndarray with named axes]

Lluís Tue, 06 Jul 2010 06:04:01 -0700

Jonathan March writes:

> Fernando Perez proposed a NumPy enhancement, an ndarray with named axes,
> prototyped as DataArray by him, Mike Trumpis, Jonathan Taylor, Matthew
> Brett, Kilian Koepsell and Stefan van der Walt.


I haven't had a thorough look into it, but this work as well as others listed in
the 'NdarrayWithNamedAxes' wiki page are similar in spirit to some numpy
extensions I've been developing.

You can find the code and some initial documentation at:
    https://people.gso.ac.upc.edu/vilanova/doc/sciexp2

I was not planning to announce it until around 1.0, as the numpy structures are
still crude and lack some operations for dynamically extending the structure
both in shape and the number of fields on each record (I have some fixes that
still need to be committed), but after seeing some related announcements lately,
I think we all might benefit from trying to join ideas and efforts.

I'll try to shortly explain with an example the part that is related to numpy
(that is, the third frontend that appears on the "User Guide": 'plotter', which
currently has documentation that is worse than poor).

Suppose you have a set of benchmarks that have been simulated with different
simulator parameters, such that you have one result file for each executed
combination of the "variables":
  * benchmark
  * parameter1
  * parameter2

Of course, for each execution you'll also have multiple results (what I call
"valuenames"; simply fields in a record array, in fact).

NOTE: scripts for such executions can be generated with the first frontend
      ('launchgen').

Then you can find and extract those results (package 'sciexp2.gather') and
organize them into an N-dimensional 'Data' object (package 'sciexp2.data'),
where the first dimension has (for example) the combinations of
"parameter1-parameter2" values, and the 2nd dimension contains one element for
each benchmark (method 'sciexp2.data.Data.reshape').

Now, you can index/slice the structure with integers (as always) _as well as_
with:
  * strings: simple indexing as well as slicing
  * "filters": slicing with a stepping

These are translated into integers through the "metadata" (benchmark name and/or
values of the 2 parameters), stored in 'sciexp2.data.Dimension' objects.

For example, to get the numbers of tests where parameter1 is between 10 and 100
and just for benchmarks named 'bench1' and 'bench2':

           data[::"10 < parameter1 && parameter1 < 100",["bench1", "bench2"]]


There is a third package extending matplotlib that I have not uploaded (nor
fully developed) that is meant to use the dimension and record metadata in the
Data object, such that data can be easily plotted.

It extracts labels for axis and legends from metadata, and can "exand"
operations. For example:
  * Plot one figure for each benchmark simply declaring the figure as to be
    "expanded" through the 'benchmark' variable.
  * Plot multiple lines/bars/whatever with a single plot command, like "plot
    such and such for each benchmark", or "plot such and such for each
    configuration and cluster by benchmark name".

More extensive examples can be seen on the following URL, which is from a much
older version that wasn't using numpy nor matplotlib, and provided a somewhat
functional API (SIZE, CPREFETCH, RPREFETCH and SIMULATOR are execution
parameters in these examples; fun starts at line 78):
  
https://projects.gso.ac.upc.edu/projects/sciexp2/repository/revisions/200/entry/progs/sciexp2/tags/0.5/plotter/examples/01-spec-figures.cfg


Finally, some things that have been bugging me about numppy are:

  * My 'Data' object is similar to a 'reacarray', such that record elements
    (what I call "valuenames"), can be accessed as attributes. But to avoid the
    cost of a recarray, I use an ndarray with records.
    This has the unfortunate effect that "valuenames" cannot be accessed as
    attributes on a record, but only when it really is a 'Data' object.
    Tried to add some methods to numpy.void from my python code to access record
    fields as attributes, but of course that's not possible.

  * I'd like to associate extra information to dtype, instead of manually
    carrying it around on every operation accessing a record field. Namely:
     * a description; such that it can be automatically used as axis/legend
       labels in matplotlib.
     * unit information; such that units of results can be automatically
       computed when operating with numpy, and later extracted when plotted with
       matplotlib.
       For this, existing packages like 'units' in PyPy could be used.

  * The ability for operating on records instead of separate record fields, such
    that i can:
        b = a[0] + a[1]
    instead of:
        b_f1 = a[0]["f1"] + a[1]["f1"]
        b_f2 = a[0]["f2"] + a[1]["f2"]
    whenever possible.


Comments are welcome.

apa!

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Yet another axes naming package [Was: Re: BOF notes: Fernando's proposal: NumPy ndarray with named axes]

Reply via email to