[ 
https://issues.apache.org/jira/browse/STATISTICS-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802218#comment-16802218
 ] 

Eric Barnhill commented on STATISTICS-7:
----------------------------------------

Hi [~Salman], sounds like you would be in a great position to contribute the 
streaming code that is needed.

Just to be clear everyone, there is plenty of work here and [~erans] and I can 
(I assume) split this ticket so that we can assign a project of interest to 
whomever is interested. Right off the top of my head, I can think of a few 
different work packages here:

There are the summary statistics in stat.descriptive.*, which are already quite 
a lot, however the work is not restricted to this in any way. Some of these 
stats are pretty obscure for a "common" library, for example who uses 
FourthMoment?; meanwhile many metrics found in for example 
[https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics] 
are very commonly used in 2019 and their implementation with a clear user 
interface in Java is long overdue. If you write a proposal for such a project, 
you will be very free to take the initiative and contribute what you want here.

There is also a whole regression library, and porting that might be of 
particular interest to anyone interested in machine learning. It is also a 
suboptimally designed library IMO, I think the "SimpleRegression" class is 
evidence of bad design. Regression is such a huge percentage of actual ML in 
the wild, knowing it very technically, and being able to code it up, is I think 
a real asset that would put you ahead of 99% of aspiring data scientists.

Finally there is a whole library of statistical tests, as well as correlation 
and covariance lbraries, this would be another work package, for someone 
perhaps more interested in statistics getting into applied math, and coding 
those algorithms.

> Stream-based Java statistical processing
> ----------------------------------------
>
>                 Key: STATISTICS-7
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-7
>             Project: Apache Commons Statistics
>          Issue Type: New Feature
>            Reporter: Eric Barnhill
>            Priority: Major
>              Labels: GSoC2019, gsoc2019, statistics, streams
>
> The new component aims to be a library of commons statistics functions 
> synchronized with the latest developments in the Java language, in particular 
> Java's functional programming syntax.
> The library will make commonly used statistical functions available to an end 
> user through a simple grammar comparable to commons-math-statistics or 
> scikit-learn, while under the hood will implement Java's mapping, streaming, 
> and other producer and consumer functions to ensure the statistical methods 
> run optimally in new Java implementations.
> Developers working on the project will have the opportunity to demonstrate 
> Java programming, functional programming, algorithm design, and data science 
> skills and receive authorship on a commons project that is likely to be 
> widely used.
> The ideal contributor will also be able to help with important architectural 
> decision making. The old source of these libraries, commons-math, grew too 
> large, hierarchically complex and interdependent for the commons mission. The 
> developers on this project need to make architectural choices that will 
> enable the statiscal code to be lightweight and reusable, with a minimum of 
> outside dependencies while avoiding redundancy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to