[
https://issues.apache.org/jira/browse/STATISTICS-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801871#comment-16801871
]
Eric Barnhill commented on STATISTICS-7:
----------------------------------------
Hi [~virendrasinghrp]
I work as a data scientist for a Silicon Valley company so I know data science
is a big thing and I know what languages are used.
Java makes up a large part of commercial infrastructure and Apache commons in
particular is used everywhere. Even in a small project I am working on right
now the devs use commons-cli, commons-csv, commons-lang . The applications of
commons-statistics will just be innumerable.
If you want something data science specific you probably know that nearly all
data engineering infrastructure is in Java. The ability to run some statistical
mappings on the side of that infrastructure would I think be very valuable. If
the mappings were implemented functionally, it would scale effortlessly. A lot
of job offers for data scientist are really mostly engineering, so knowing that
side of the business is really good.
But I am not going to spend a lot of time defending the project I think it
obviously has a very wide audience.
I also disagree that it is easy. There is a lot that goes into development,
testing, documentation and release of widely used software. Also like Gilles
said there are a lot of architectural decisions to be made about how the stats
libraries are going to hang together. It is decisions at this level, that
separate good engineers from bad. We do not want a bunch of independent
scripts, I could write those very quickly.
If we make good progress there are lots of ways to extend the project into ML
tools like logistic regression (which is probably what people use 90%+ of the
time, from what I hear).
> Stream-based Java statistical processing
> ----------------------------------------
>
> Key: STATISTICS-7
> URL: https://issues.apache.org/jira/browse/STATISTICS-7
> Project: Apache Commons Statistics
> Issue Type: New Feature
> Reporter: Eric Barnhill
> Priority: Major
> Labels: GSoC2019, gsoc2019, statistics, streams
>
> The new component aims to be a library of commons statistics functions
> synchronized with the latest developments in the Java language, in particular
> Java's functional programming syntax.
> The library will make commonly used statistical functions available to an end
> user through a simple grammar comparable to commons-math-statistics or
> scikit-learn, while under the hood will implement Java's mapping, streaming,
> and other producer and consumer functions to ensure the statistical methods
> run optimally in new Java implementations.
> Developers working on the project will have the opportunity to demonstrate
> Java programming, functional programming, algorithm design, and data science
> skills and receive authorship on a commons project that is likely to be
> widely used.
> The ideal contributor will also be able to help with important architectural
> decision making. The old source of these libraries, commons-math, grew too
> large, hierarchically complex and interdependent for the commons mission. The
> developers on this project need to make architectural choices that will
> enable the statiscal code to be lightweight and reusable, with a minimum of
> outside dependencies while avoiding redundancy.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)