[
https://issues.apache.org/jira/browse/STATISTICS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707525#comment-17707525
]
Anirudh Joshi commented on STATISTICS-54:
-----------------------------------------
Hello [~aherbert] and [~erans]. Hope you are doing well. My name is Anirudh and
I am interested in contributing to this project as part of GSoC 2023. I am
working on my proposal but would like to discuss my ideas with the community
before I finalize my idea to see if I am thinking in the right direction.
I have been familiarizing myself with commons-stat/stat/descriptive project
over the past few days. I saw that the current implementation of
SummaryStatistics works only with sequential stream of values since the
combiner parameter of Stream::collect is never invoked in this case. Our goal
is to add support for parallel streams too since it would definitely would help
us reduce processing time to compute Summary Statistics esp. when the dataset
size is large.
An important ingredient that we need to support streams is the `merge`
functionality. We need the ability to merge two partially constructed
`StorelessUnivariateStatistic` objects. Once we implement this for all
implementing classes of StorelessUnivariateStatistic we would be able to
compute partial SummaryStatistic and use our merge function to aggregate these
partially constructed SummaryStatistic objects to a result SummaryStatistic
object that gives out the statistics for the entire dataset.
My idea is to define a generic interface as follows
{code:java}
public interface StatisticAccumulator<T extends StorelessUnivariateStatistic> {
// Add a single value to the accumulator
void add(double d);
// To ensure that the parameter to merge function are bound to an
accumulator impl of the same statistic type T
<U extends StatisticAccumulator<T>> void merge(U other);
// Merge two partially constructed StorelessUnivariateStatistic objects
void merge(T other);
// Get the statistic we are trying to accumulate
T get();
} {code}
And have implementations for various statistics we have such as
MeanAccumulator, GeometricMeanAccumulator, VarianceAccumulator etc.
A sample usage (assuming we have an implementation for MeanAccumulator) would
look like
{code:java}
List<Double> data = Arrays.asList(1.0, 2.0, 3.0, 4.0, -1.0);
Mean mean = data.parallelStream()
.collect(MeanAccumulator::new, MeanAccumulator::add,
MeanAccumulator::merge)
.get(); {code}
I have a [proof of concept
PR|https://github.com/apache/commons-math/compare/master...ani5rudh:commons-math:STATISTICS-54-Proof-Of-Concept]
for my approach with implementation for MeanAccumulator.
I am still a student learning principles of Object Oriented Design and
Modelling, so my approach may not be perfect. I would like to know your
thoughts on my approach so that I fix and improve my design. Your feedback is
very valuable for my learning and developing my skills.
I also wanted to know if the scope of the project as far as GSoC is concerned
is to add stream support along with unit tests for all the sub classes of
`AbstractStorelessUnivariateStatistic` (around 17 of them) or is it a subset of
these ? I am asking since to get clarity on the goals and plan accordingly to
achieve the goals in 12 weeks of GSoC coding period. Please let me know. Thanks
in advance!
> [GSoC] Summary statistics API for Java 8 streams
> ------------------------------------------------
>
> Key: STATISTICS-54
> URL: https://issues.apache.org/jira/browse/STATISTICS-54
> Project: Commons Statistics
> Issue Type: Wish
> Components: descriptive
> Reporter: Alex Herbert
> Priority: Minor
> Labels: full-time, gsoc, gsoc2022, gsoc2023
> Fix For: 1.0
>
>
> Placeholder for tasks that could be undertaken in this year's
> [GSoC|https://summerofcode.withgoogle.com/].
> Ideas:
> - Design an updated summary statistics API for use with Java 8 streams based
> on the summary statistic implementations in the Commons Math
> {{stat.descriptive}} package including {{{}moments{}}}, {{rank}} and
> {{summary}} sub-packages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)