[
https://issues.apache.org/jira/browse/MAHOUT-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saikat Kanjilal resolved MAHOUT-1869.
-------------------------------------
Resolution: Won't Fix
After talking to Andrew Musselman I am closing this issue and focusing on bugs
and documentation
> Create a runtime performance measuring framework for mahout
> -----------------------------------------------------------
>
> Key: MAHOUT-1869
> URL: https://issues.apache.org/jira/browse/MAHOUT-1869
> Project: Mahout
> Issue Type: Story
> Components: build, Classification, Collaborative Filtering, Math
> Affects Versions: 1.0.0
> Reporter: Saikat Kanjilal
> Labels: build
> Fix For: 1.0.0
>
> Original Estimate: 1,008h
> Remaining Estimate: 1,008h
>
> This proposal will outline a runtime performance module used to measure the
> performance of various algorithms in mahout in the three major areas,
> clustering, regression and classification. The module will be a
> spray/scala/akka application which will be run by any current or new
> algorithm in mahout and will display a csv file and a set of zeppelin plots
> outlining the various criteria for performance. The goal of releasing any new
> build in mahout will be to run a set of tests for each of the algorithms to
> compare and contrast some benchmarks from one release to another.
> github repo is here: https://github.com/skanjila/mahout, will send pull
> request when I have 1 algorithm operational
> Architecture
> The run time performance application will run on top of spray/scala and akka
> and will make async api calls into the various mahout algorithms to generate
> a cvs file containing data representing the run time performance measurement
> calculations for each algorithm of interest as well as a set of zeppelin
> plots for displaying some of these results. The spray scala architecture will
> leverage the zeppelin server to create the visualizations. The discussion
> below centers around two types of algorithms to be addressed by the
> application.
> Clustering
> The application will consist of a set of rest APIs to do the following:
> a) A method to load and execute the run time perf module and takes as inputs
> the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of
> files containing various sizes of data sets
> /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
> and finally a set of values for the number of clusters to use for each of
> the different sizes of the datasets
> The above API call will return a runId which the client program can then use
> to monitor the module
> b) A method to monitor the application to ensure that its making progress
> towards generating the zeppelin plots
> /monitor/runId=456
> The above method will execute asynchronously by calling into the mahout
> kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin
> plots showing the normalized time on the y axis and the number of clusters in
> the x axis. The spray/scala akka framework will allow the client application
> to receive a callback when the run time performance calculations are actually
> completed. For now the calculations for measuring run time performance will
> contain: a) the ratio of the number of points clustered correctly to the
> total number of points b) the total time taken for the algorithm to run .
> These items will be represented in separate zeppelin plots.
> Regression
> a) The runtime performance module will run the likelihood ratio test with a
> different set of features in every run . We will introduce a rest API to run
> the likelihood ratio test and return the results, this will once again be an
> sync call through the spray/akka stack.
> b) The run time performance module will contain the following metrics for
> every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to
> converge and run to completion. These metrics will be reported on top of the
> zeppelin graphs for both the regression and the different clustering
> algorithms mentioned above.
> How does the application get run. The run time performance measuring
> application will get invoked from the command line, eventually it would be
> worthwhile to hook this into some sort of integration test suite to certify
> the different mahout releases.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)