FYI, I'm getting a lot of these (and not moderating any more due to lack of time)
Otis -- Solr & ElasticSearch Support http://sematext.com/ ---------- Forwarded message ---------- From: <dev-reject-1364573050.63309.haimnphidmmapikej...@mahout.apache.org> Date: Fri, Mar 29, 2013 at 12:04 PM Subject: MODERATE for [email protected] To: Cc: dev-allow-tc.1364573050.abpdchciinoejcdfjbch-noreply= [email protected] To approve: dev-accept-1364573050.63309.haimnphidmmapikej...@mahout.apache.org To reject: dev-reject-1364573050.63309.haimnphidmmapikej...@mahout.apache.org To give a reason to reject: %%% Start comment %%% End comment ---------- Forwarded message ---------- From: "Dan Filimon" <[email protected]> To: "Sebastian Schelter" <[email protected]>, "Ted Dunning" < [email protected]> Cc: "Dan Filimon" <[email protected]>, "mahout" < [email protected]> Date: Fri, 29 Mar 2013 16:04:08 -0000 Subject: Re: Review Request: MAHOUT-1181: Adds StreamingKMeans MapReduce classes This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10193/ On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote: core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line203> (Diff revision 1) None {'text': ' private void configureOptionsForWorkers() throws ClassNotFoundException, IllegalAccessException,', 'line': 175} 203 log.info("No measure class given, using EuclideanDistanceMeasure"); Why not make euclidean distance the default value of the distance measure option? I forgot to do that myself because the option is in DefaultOptionCreator. Fortunately, the default set there, SquaredEuclideanDistance is a great default, probably better than EuclideanDistance. So, I just removed this chunk of code entirely. On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote: core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line309> (Diff revision 1) None {'text': ' private void configureOptionsForWorkers() throws ClassNotFoundException, IllegalAccessException,', 'line': 175} 309 log.error("Measure class not found " + measureClass, e); program should throw an exception and terminate if the distance measure class cannot be found, right? Indeed. I removed the try/catch. On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote: core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line315> (Diff revision 1) None {'text': ' private void configureOptionsForWorkers() throws ClassNotFoundException, IllegalAccessException,', 'line': 175} 315 log.error("Searcher class not found " + measureClass, e); program should throw an exception and terminate if the searcher class cannot be found, right? Yep, same as above. - Dan On March 29th, 2013, 4:03 p.m., Dan Filimon wrote: Review request for mahout, Ted Dunning and Sebastian Schelter. By Dan Filimon. *Updated March 29, 2013, 4:03 p.m.* Description This depends (loosely) on https://reviews.apache.org/r/10194/ This patch implements the MapReduce version of StreamingKMeans for MAHOUT-1154. It adds 5 new classes: - CentroidWritable: class representing a centroid that can be written to a SeqFile - StreamingKMeansDriver: class implementing AbstractJob that is the entry point to the mapreduction - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) clustering the points one by one - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a number of times and picking the clustering with the lowest total clustering cost. The cost is determined by randomly splitting the incoming centroids into a "training" and "test" set, computing the centroids on the training set and the cost on the test set. The intent is to see whether the centroids actually describe the distribution of the points or not. - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher from a Configuration. Additionally, there is a test class StreamingKMeansTestMR that tests the mapper, reducer and mapper and reducer together using MRUnit. !!! Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a dependency. We depend on snapshot 1.0 which is not yet released (it will be very soon), hence the updated pom.xml is not provided for now. !!! Testing See StreamingKMeansTestMR for the tests. These are all performed on data sample from a "hypercube" distribution (there are multinormal distributions in each vertex of the cube). Additionally there are ongoing tests on the 20 newsgroups data set (and some more are on the way). Diffs - core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java (PRE-CREATION) - core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java (PRE-CREATION) - core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java (PRE-CREATION) - core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java (PRE-CREATION) - core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java (PRE-CREATION) - core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java (PRE-CREATION) - src/conf/driver.classes.default.props (ac45eef) View Diff <https://reviews.apache.org/r/10193/diff/>
