[ 
https://issues.apache.org/jira/browse/STATISTICS-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801871#comment-16801871
 ] 

Eric Barnhill commented on STATISTICS-7:
----------------------------------------

Hi [~virendrasinghrp]

I work as a data scientist for a Silicon Valley company so I know data science 
is a big thing and I know what languages are used.

Java makes up a large part of commercial infrastructure and Apache commons in 
particular is used everywhere. Even in a small project I am working on right 
now the devs use commons-cli, commons-csv, commons-lang . The applications of 
commons-statistics will just be innumerable.

If you want something data science specific you probably know that nearly all 
data engineering infrastructure is in Java. The ability to run some statistical 
mappings on the side of that infrastructure would I think be very valuable. If 
the mappings were implemented functionally, it would scale effortlessly. A lot 
of job offers for data scientist are really mostly engineering, so knowing that 
side of the business is really good.

But I am not going to spend a lot of time defending the project I think it 
obviously has a very wide audience.

I also disagree that it is easy. There is a lot that goes into development, 
testing, documentation and release of widely used software. Also like Gilles 
said there are a lot of architectural decisions to be made about how the stats 
libraries are going to hang together. It is decisions at this level, that 
separate good engineers from bad. We do not want a bunch of independent 
scripts, I could write those very quickly.

If we make good progress there are lots of ways to extend the project into ML 
tools like logistic regression (which is probably what people use 90%+ of the 
time, from what I hear).

 

> Stream-based Java statistical processing
> ----------------------------------------
>
>                 Key: STATISTICS-7
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-7
>             Project: Apache Commons Statistics
>          Issue Type: New Feature
>            Reporter: Eric Barnhill
>            Priority: Major
>              Labels: GSoC2019, gsoc2019, statistics, streams
>
> The new component aims to be a library of commons statistics functions 
> synchronized with the latest developments in the Java language, in particular 
> Java's functional programming syntax.
> The library will make commonly used statistical functions available to an end 
> user through a simple grammar comparable to commons-math-statistics or 
> scikit-learn, while under the hood will implement Java's mapping, streaming, 
> and other producer and consumer functions to ensure the statistical methods 
> run optimally in new Java implementations.
> Developers working on the project will have the opportunity to demonstrate 
> Java programming, functional programming, algorithm design, and data science 
> skills and receive authorship on a commons project that is likely to be 
> widely used.
> The ideal contributor will also be able to help with important architectural 
> decision making. The old source of these libraries, commons-math, grew too 
> large, hierarchically complex and interdependent for the commons mission. The 
> developers on this project need to make architectural choices that will 
> enable the statiscal code to be lightweight and reusable, with a minimum of 
> outside dependencies while avoiding redundancy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to