On 9/22/07, Bradford Cross <[EMAIL PROTECTED]> wrote: > Greetings! > > Recently I stumbled into the Commons math project; nice design, good > abstractions, "smart updates" and even unit tests! :-) > Thanks!
> the Smart updates are a key feature for event stream processing / time > series simulation. The only piece that is missing from a time series > analysis and simulation perspective is the ability to supply a lag that > defines a fixed sample size and perform rolling calculations. > That functionality actually already exists in the DescriptiveStatistics class. You can set a "window size" for rolling computations of univariate statistics using the concrete implementation of this class, o.a.c.math.stat.descriptive.DescriptiveStatisticsImpl. See http://commons.apache.org/math/userguide/stat.html > I was very happy to see this as an item on the wish list. The wishlist item is not as clear as it could be. Sorry about that. In addition to the computations in DescriptiveStatistics that require that you maintain all of the values in the current window in memory, we also support "storeless" computation of statistics than can be computed in one pass through the data. This allows very large data streams to be handled with fixed storage overhead. I think that what the wishlist item refers to is something in between - ways to support the window concept without storing all of the data. Strictly speaking, this is impossible, but doing things like sampling from the streams, periodically resetting or maintaining arrays of storeless stats with different offsets would in theory be possible. > > A ThoughtWorks colleague (Yaxin Wang) and I are prototyping a java time > series simulation engine and we are considering the commons math as the base > of our numerical libraries. In order to do this we need to complete the > rolling calculations, so here is our first spike (spike means prototype that > can be thrown away / not a real patch.) We thought we would start with an > easy case; mean, which uses sum. > > We have already combined the rolling calculations with the smart update > algorithms before in the numerical libraries for our previous time series > simulation engine. As you have mentioned in the wish list notes, our past > experience is that some of the algorithms can not avoid using queues for > rolling updates case. Obviously it is something pretty fundamental to the > design and requires a bit of work across a lot of places to do this for all > the statistics (at least starting with summary statistics.) > > Please give feedback on the design, any issues with performance (better data > structure than the queue we used), etc! > > If the community is OK with this initial spike, then we can start submitting > patches. :-) > Thanks for the contribution! There are a few problems with incorporating the code as is, though. First it uses generics and the concurrent package, which requires JDK 1.5 and our current minimum JDK level is 1.3. That could probably be eliminated fairly easily, though. The second is really whether or not the queue implementation is going to improve performance over the ResizeableDoubleArray store that DescriptiveStatisticsImpl uses now. If you think so and can demonstrate with benchmarks, we can talk about swapping out that implementation. Otherwise, its probably better to use ResizeableDoubleArray. I am +1 on adding a RollingStatistic abstract base class (would prefer that name to "Statistic" since it is specialized) like you have defined and rolling versions of the individual statistics. This would be a convenience over the current setup and provide a more intuitive way to access rolling stats than to use DescriptiveStatisticsImpl as a container. Currently this is only the only way to do it. So if you can refactor to either use ResizableDoubleArray as the backing store (look at DescriptiveStatisticsImpl.apply - the convenience classes could just use that pattern) or otherwise eliminate the JDK 1.5 dependency, I would support adding the rolling stats. If I understand correctly the idea of what you mean by Sum, and Mean (using constructor arguments to determine whether or not statistic is rolling), I would prefer to leave the existing statistics in commons-math as is and introduce Rolling versions as separate classes. One more thing. It is very important that any contributions that you make can be made in accordance with the Apache Contributor's License Agreement. Have a look here: http://www.apache.org/licenses/#clas and make sure you can agree to those terms. Then you can start submitting patches with attachements to Jira tickets. Thanks! Phil --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]