[jira] [Commented] (STATISTICS-11) OVERALL-TASK (not yet split): Designing Robust Class Structure and Architecture

Alex D Herbert (JIRA) Thu, 23 May 2019 13:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847027#comment-16847027
 ]


Alex D Herbert commented on STATISTICS-11:
------------------------------------------

I would make the output of the RegressionDataLoader an interface 
RegressionData. That should be passed to the downstream classes. This interface 
will only require the methods:
{code:java}
StatisticsMatrix getXData();

StatisticsMatrix getYData();

StatisticsMatrix getHasIntercept();
{code}
Once the data is loaded all the downstream classes do not need to have access 
to the methods in the RegressionDataLoader. They only require the final data.

I associate *{{Loader}}* with some sort of IO operation. You could start with 
an in-memory loader (RegressionDataBuilder) and add more loaders for specific 
use cases later, e.g file IO, or different formats such as 2D arrays and 1D 
arrays of packed 2D data.

 

> OVERALL-TASK (not yet split): Designing Robust Class Structure and 
> Architecture
> -------------------------------------------------------------------------------
>
>                 Key: STATISTICS-11
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-11
>             Project: Apache Commons Statistics
>          Issue Type: Sub-task
>            Reporter: Ben Nguyen
>            Priority: Major
>         Attachments: Current-Implementation.png, Proposed Detailed UML.png, 
> Proposed-Implementation.png
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+
> Hello,
> I have some broad general ideas about how the regression module should be 
> structured, as outlined in my proposal briefly with UMLs
> This is the current implementation inside commons-math-stat-regression:
>  
> [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png]
>  
>  *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}It seems there is/was an image here but I don't see it.{color}
> {color:#FF0000}For this kind of information, please use JIRA (and provide the 
> link here).{color}
> This is my propsed idea, where the structure was partly inspired by SuanShu 
> since it supported multiple types of regression (including logistic):
> [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear]
>  
> Disclaimer: I have only studied some econometrics and second year computer 
> science in university, so I have zero professional data engineering 
> experience, but am excited to start learning with this project. So, I don’t 
> currently know the exact needs of data engineers in regards to this module 
> and am learning as I go….which is why I would very much appreciate any input 
> on the kinds of requirements data engineers would want from this regression 
> module.
>  *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}Basing a design on use-cases is very useful.{color}
> {color:#FF0000}You should collect a range of them (small/large datasets, 
> in-memory/stream,{color}
> {color:#FF0000}dense/sparse) in order to figure what parts of the code can be 
> common and{color}
> {color:#FF0000}what requires specialization.{color}
> From someone who has used the current implementation or will use this new 
> implementation:
>  * What would make your life easier?
>  * What should definitely be kept?
>  * What should be added/improved?
>  * Any specific features or design criterions?
>  * Any changes or radically different approaches to the following idea?
>  *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}Good questions!{color}
> {color:#FF0000}What are your answers? ;-){color}
> Note: OLS, GLS and Logistic regression are the first to be implemented, with 
> focus to make architectural support for further additions. Changes will make 
> use of new Java 8 features, specifically the Java Streams API to improve 
> performance and readability.
>  
>  *{color:#FF0000}GILLES SADOWSKI:{color}*
> {color:#FF0000}+1{color}
> {color:#FF0000}I'd suggest to select one and start coding, without fearing 
> that you'll{color}
> {color:#FF0000}probably have to change a lot of it as more use-cases are 
> collected.{color}
> [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png]
> *+Updates to this proposed implementation UML in my proposal:+*
>  * “statistics-regression-reqLinearMath” will be replaced with EJML as 
> suggested by Mr. Eric Barnhill
>  * This will include a custom matrix class extended from EJML’s SimpleBase -> 
> StatisticsMatrix
>  * So if we decide to use an Apache Commons implementation of matrices later 
> on, only this class should be changed internally.
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}Good precaution; but I doubt that we can include everything in 
> a{color}
> {color:#FF0000}single class.{color}
> {color:#FF0000}How to best encapsulate the linear algebra (external) library 
> is a{color}
> {color:#FF0000}subject on its own, worth its own thread:  Cramming many 
> questions{color}
> {color:#FF0000}in a single post makes it likely that some will be missed by 
> some{color}
> {color:#FF0000}people who might later on question the chosen path.  
> [External{color}
> {color:#FF0000}dependencies is a sensitive issue, in Commons...]{color}
> {color:#FF0000} {color}
> {color:#FF0000}Also, I remind that we need to take into account the 
> comparative{color}
> {color:#FF0000}benchmarks which I posted recently.  [Even if just to conclude 
> that{color}
> {color:#FF0000}EJML has overwhelming advantages (which?) that make it 
> more{color}
> {color:#FF0000}suitable than its "competitors".]{color}
>  * Abstract classes should have interfaces above them or perhaps just be 
> interfaces if a simpler approach is implemented (ie minimal OOP)
> *+Notes about this proposed implementation:+*
>  * AbstractVariables and it’s child classes may not be necessary, ie just 
> Estimators and Residuals classes
>  * Or perhaps it’s best to follow the current implementation’s example and 
> have a single class per regression type for hierarchy simplicity (but risking 
> redundancies)?
>  * I have not looked into specific data members or individual methods yet. So 
> far just taking notes from the current implementation and SuanShu
>  * The “statistics-regression-updating” components have quite complex 
> algorithms which will require a lot of time for me to understand completely
>  * So for now, I see myself making minimal changes to them, prioritizing the 
> new “stored” components.
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}IMHO, this will better discussed once an initial 
> implementation is shown{color}
> {color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color}
> {color:#FF0000}Again, better to start a new thread for each specific 
> question, possibly backed{color}
> {color:#FF0000}with a new JIRA report focussed on a particular task (see 
> "Create sub-tasks"{color}
> {color:#FF0000}on JIRA).{color}
>  * RegressionDataLoader’s purpose is to:
>  * provide a clean input interface
>  * and to ensure that data from say double[ ][ ] is only converted to working 
> form as a StatisticsMatrix object once
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from 
> "useful"{color}
> {color:#FF0000}computations.{color}
> {color:#FF0000}I.e. I suggest that we consider on the one hand what API is 
> required for all the{color}
> {color:#FF0000}intented functionalitites, and on the other (in a *different* 
> "maven{color}
> {color:#FF0000}module"), all the{color}
> {color:#FF0000}conversions that may be implemented for the convenience of 
> users.{color}
>  * while allowing multiple types of regression to be calculated via a 
> universal form….
>  * which could become a challenge once details are in order.
>  
> So this is the current state of my plan, with your input, I will move to the 
> next steps, plan more details and start creating the software flowchart.
>  
> Thank you in advance for any advice/suggestions,
> {color:#FF0000} *GILLES SADOWSKI:*{color}
> {color:#FF0000}To summarize, my main suggestion is to split this post in 
> more{color}
> {color:#FF0000}manageable chunks.{color}
> {color:#FF0000}Regards,{color}
> {color:#FF0000}Gilles{color}
> -Ben Nguyen



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STATISTICS-11) OVERALL-TASK (not yet split): Designing Robust Class Structure and Architecture

Reply via email to