On 9/22/07, Bradford Cross <[EMAIL PROTECTED]> wrote:
> Greetings!
>
> Recently I stumbled into the Commons math project; nice design, good
> abstractions, "smart updates" and even unit tests! :-)
>
Thanks!

> the Smart updates are a key feature for event stream processing / time
> series simulation.  The only piece that is missing from a time series
> analysis and simulation perspective is the ability to supply a lag that
> defines a fixed sample size and perform rolling calculations.
>

That functionality actually already exists in the
DescriptiveStatistics class.  You can set a "window size" for rolling
computations of univariate statistics using the concrete
implementation of this class,
o.a.c.math.stat.descriptive.DescriptiveStatisticsImpl.  See
http://commons.apache.org/math/userguide/stat.html

> I was very happy to see this as an item on the wish list.

The wishlist item is not as clear as it could be.  Sorry about that.
In addition to the computations in DescriptiveStatistics that require
that you maintain all of the values in the current window in memory,
we also support "storeless" computation of statistics than can be
computed in one pass through the data. This allows very large data
streams to be handled with fixed storage overhead.  I think that what
the wishlist item refers to is something in between - ways to support
the window concept without storing all of the data.  Strictly
speaking, this is impossible, but doing things like sampling from the
streams, periodically resetting or maintaining arrays of storeless
stats with different offsets would in theory be possible.
>
> A ThoughtWorks colleague (Yaxin Wang) and I are prototyping a java time
> series simulation engine and we are considering the commons math as the base
> of our numerical libraries.  In order to do this we need to complete the
> rolling calculations, so here is our first spike (spike means prototype that
> can be thrown away / not a real patch.)  We thought we would start with an
> easy case; mean, which uses sum.
>
> We have already combined the rolling calculations with the smart update
> algorithms before in the numerical libraries for our previous time series
> simulation engine.  As you have mentioned in the wish list notes, our past
> experience is that some of the algorithms can not avoid using queues for
> rolling updates case.  Obviously it is something pretty fundamental to the
> design and requires a bit of work across a lot of places to do this for all
> the statistics (at least starting with summary statistics.)
>
> Please give feedback on the design, any issues with performance (better data
> structure than the queue we used), etc!
>
> If the community is OK with this initial spike, then we can start submitting
> patches. :-)
>

Thanks for the contribution! There are a few problems with
incorporating the code as is, though.  First it uses generics and the
concurrent package, which requires JDK 1.5 and our current minimum JDK
level is 1.3.  That could probably be eliminated fairly easily,
though.  The second is really whether or not the queue implementation
is going to improve performance over the ResizeableDoubleArray store
that DescriptiveStatisticsImpl uses now.  If you think so and can
demonstrate with benchmarks, we can talk about swapping out that
implementation.  Otherwise, its probably better to use
ResizeableDoubleArray.

I am +1 on adding a RollingStatistic abstract base class (would prefer
that name to "Statistic" since it is specialized) like you have
defined and rolling versions of the individual statistics.  This would
be a convenience over the current setup and provide a more intuitive
way to access rolling stats than to use DescriptiveStatisticsImpl as a
container.  Currently this is only the only way to do it.  So if you
can refactor to either use ResizableDoubleArray as the backing store
(look at DescriptiveStatisticsImpl.apply - the convenience classes
could just use that pattern) or otherwise eliminate the JDK 1.5
dependency, I would support adding the rolling stats.  If I understand
correctly the idea of what you mean by Sum, and Mean (using
constructor arguments to determine whether or not statistic is
rolling), I would prefer to leave the existing statistics in
commons-math as is and introduce Rolling versions as separate classes.

One more thing.  It is very important that any contributions that you
make can be made in accordance with the Apache Contributor's License
Agreement.  Have a look here:
http://www.apache.org/licenses/#clas
and make sure you can agree to those terms.  Then you can start
submitting patches with attachements to Jira tickets.

Thanks!

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to