[
https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847027#comment-16847027
]
Alex D Herbert commented on STATISTICS-11:
------------------------------------------
I would make the output of the RegressionDataLoader an interface
RegressionData. That should be passed to the downstream classes. This interface
will only require the methods:
{code:java}
StatisticsMatrix getXData();
StatisticsMatrix getYData();
StatisticsMatrix getHasIntercept();
{code}
Once the data is loaded all the downstream classes do not need to have access
to the methods in the RegressionDataLoader. They only require the final data.
I associate *{{Loader}}* with some sort of IO operation. You could start with
an in-memory loader (RegressionDataBuilder) and add more loaders for specific
use cases later, e.g file IO, or different formats such as 2D arrays and 1D
arrays of packed 2D data.
> OVERALL-TASK (not yet split): Designing Robust Class Structure and
> Architecture
> -------------------------------------------------------------------------------
>
> Key: STATISTICS-11
> URL: https://issues.apache.org/jira/browse/STATISTICS-11
> Project: Apache Commons Statistics
> Issue Type: Sub-task
> Reporter: Ben Nguyen
> Priority: Major
> Attachments: Current-Implementation.png, Proposed Detailed UML.png,
> Proposed-Implementation.png
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+
> Hello,
> I have some broad general ideas about how the regression module should be
> structured, as outlined in my proposal briefly with UMLs
> This is the current implementation inside commons-math-stat-regression:
>
> [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png]
>
> *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}It seems there is/was an image here but I don't see it.{color}
> {color:#FF0000}For this kind of information, please use JIRA (and provide the
> link here).{color}
> This is my propsed idea, where the structure was partly inspired by SuanShu
> since it supported multiple types of regression (including logistic):
> [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear]
>
> Disclaimer: I have only studied some econometrics and second year computer
> science in university, so I have zero professional data engineering
> experience, but am excited to start learning with this project. So, I don’t
> currently know the exact needs of data engineers in regards to this module
> and am learning as I go….which is why I would very much appreciate any input
> on the kinds of requirements data engineers would want from this regression
> module.
> *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}Basing a design on use-cases is very useful.{color}
> {color:#FF0000}You should collect a range of them (small/large datasets,
> in-memory/stream,{color}
> {color:#FF0000}dense/sparse) in order to figure what parts of the code can be
> common and{color}
> {color:#FF0000}what requires specialization.{color}
> From someone who has used the current implementation or will use this new
> implementation:
> * What would make your life easier?
> * What should definitely be kept?
> * What should be added/improved?
> * Any specific features or design criterions?
> * Any changes or radically different approaches to the following idea?
> *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}Good questions!{color}
> {color:#FF0000}What are your answers? ;-){color}
> Note: OLS, GLS and Logistic regression are the first to be implemented, with
> focus to make architectural support for further additions. Changes will make
> use of new Java 8 features, specifically the Java Streams API to improve
> performance and readability.
>
> *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}+1{color}
> {color:#FF0000}I'd suggest to select one and start coding, without fearing
> that you'll{color}
> {color:#FF0000}probably have to change a lot of it as more use-cases are
> collected.{color}
> [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png]
> *+Updates to this proposed implementation UML in my proposal:+*
> * “statistics-regression-reqLinearMath” will be replaced with EJML as
> suggested by Mr. Eric Barnhill
> * This will include a custom matrix class extended from EJML’s SimpleBase ->
> StatisticsMatrix
> * So if we decide to use an Apache Commons implementation of matrices later
> on, only this class should be changed internally.
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}Good precaution; but I doubt that we can include everything in
> a{color}
> {color:#FF0000}single class.{color}
> {color:#FF0000}How to best encapsulate the linear algebra (external) library
> is a{color}
> {color:#FF0000}subject on its own, worth its own thread: Cramming many
> questions{color}
> {color:#FF0000}in a single post makes it likely that some will be missed by
> some{color}
> {color:#FF0000}people who might later on question the chosen path.
> [External{color}
> {color:#FF0000}dependencies is a sensitive issue, in Commons...]{color}
> {color:#FF0000} {color}
> {color:#FF0000}Also, I remind that we need to take into account the
> comparative{color}
> {color:#FF0000}benchmarks which I posted recently. [Even if just to conclude
> that{color}
> {color:#FF0000}EJML has overwhelming advantages (which?) that make it
> more{color}
> {color:#FF0000}suitable than its "competitors".]{color}
> * Abstract classes should have interfaces above them or perhaps just be
> interfaces if a simpler approach is implemented (ie minimal OOP)
> *+Notes about this proposed implementation:+*
> * AbstractVariables and it’s child classes may not be necessary, ie just
> Estimators and Residuals classes
> * Or perhaps it’s best to follow the current implementation’s example and
> have a single class per regression type for hierarchy simplicity (but risking
> redundancies)?
> * I have not looked into specific data members or individual methods yet. So
> far just taking notes from the current implementation and SuanShu
> * The “statistics-regression-updating” components have quite complex
> algorithms which will require a lot of time for me to understand completely
> * So for now, I see myself making minimal changes to them, prioritizing the
> new “stored” components.
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}IMHO, this will better discussed once an initial
> implementation is shown{color}
> {color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color}
> {color:#FF0000}Again, better to start a new thread for each specific
> question, possibly backed{color}
> {color:#FF0000}with a new JIRA report focussed on a particular task (see
> "Create sub-tasks"{color}
> {color:#FF0000}on JIRA).{color}
> * RegressionDataLoader’s purpose is to:
> * provide a clean input interface
> * and to ensure that data from say double[ ][ ] is only converted to working
> form as a StatisticsMatrix object once
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from
> "useful"{color}
> {color:#FF0000}computations.{color}
> {color:#FF0000}I.e. I suggest that we consider on the one hand what API is
> required for all the{color}
> {color:#FF0000}intented functionalitites, and on the other (in a *different*
> "maven{color}
> {color:#FF0000}module"), all the{color}
> {color:#FF0000}conversions that may be implemented for the convenience of
> users.{color}
> * while allowing multiple types of regression to be calculated via a
> universal form….
> * which could become a challenge once details are in order.
>
> So this is the current state of my plan, with your input, I will move to the
> next steps, plan more details and start creating the software flowchart.
>
> Thank you in advance for any advice/suggestions,
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}To summarize, my main suggestion is to split this post in
> more{color}
> {color:#FF0000}manageable chunks.{color}
> {color:#FF0000}Regards,{color}
> {color:#FF0000}Gilles{color}
> -Ben Nguyen
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)