[jira] [Commented] (SPARK-3878) Benchmarks and common tests for mllib algorithm

Joseph K. Bradley (JIRA) Wed, 25 Mar 2015 14:48:13 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380847#comment-14380847
 ]


Joseph K. Bradley commented on SPARK-3878:
------------------------------------------

It will be great to standardize tests.  I don't think this is specific to the 
spark.ml branch, but developing tests within spark.ml might be good until they 
are ready to replace some of the spark.mllib tests.

I think a big re-work like this requires a design discussion.  Before coding a 
lot, it would be good to clarify the structure of the various pieces of the 
test infrastructure:
* data generation + real/pre-generated datasets
** It will be ideal if we use generated data to avoid adding big files.  Those 
can still be run using external tools, and the results can be hard-coded into 
tests to make sure MLlib gets similar results.
* testing utilities such as approximate equality for scalars, vectors, 
matrices, etc.
* generic tests which can be used for subsets of algorithms (such as the 
testRegressor method you wrote)
** It will be especially nice if this infrastructure can be used to avoid 
loading the same dataset multiple times.

Could you please clarify the plans here?  After the design is clear, we can add 
one example test like you wrote & then add the others piecemeal.  Thanks!

> Benchmarks and common tests for mllib algorithm
> -----------------------------------------------
>
>                 Key: SPARK-3878
>                 URL: https://issues.apache.org/jira/browse/SPARK-3878
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Egor Pakhomov
>            Assignee: Egor Pakhomov
>
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested. 
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3878) Benchmarks and common tests for mllib algorithm

Reply via email to