FYI, I'm getting a lot of these (and not moderating any more due to lack of
time)

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





---------- Forwarded message ----------
From: <dev-reject-1364573050.63309.haimnphidmmapikej...@mahout.apache.org>
Date: Fri, Mar 29, 2013 at 12:04 PM
Subject: MODERATE for [email protected]
To:
Cc: dev-allow-tc.1364573050.abpdchciinoejcdfjbch-noreply=
[email protected]



To approve:
   dev-accept-1364573050.63309.haimnphidmmapikej...@mahout.apache.org
To reject:
   dev-reject-1364573050.63309.haimnphidmmapikej...@mahout.apache.org
To give a reason to reject:
%%% Start comment
%%% End comment



---------- Forwarded message ----------
From: "Dan Filimon" <[email protected]>
To: "Sebastian Schelter" <[email protected]>, "Ted Dunning" <
[email protected]>
Cc: "Dan Filimon" <[email protected]>, "mahout" <
[email protected]>
Date: Fri, 29 Mar 2013 16:04:08 -0000
Subject: Re: Review Request: MAHOUT-1181: Adds StreamingKMeans MapReduce
classes
   This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10193/

On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote:


core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line203>
(Diff
revision 1)

None

{'text': '  private void configureOptionsForWorkers() throws
ClassNotFoundException, IllegalAccessException,', 'line': 175}

   203

      log.info("No measure class given, using EuclideanDistanceMeasure");

  Why not make euclidean distance the default value of the distance
measure option?

 I forgot to do that myself because the option is in
DefaultOptionCreator. Fortunately, the default set there,
SquaredEuclideanDistance is a great default, probably better than
EuclideanDistance. So, I just removed this chunk of code entirely.


 On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote:


core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line309>
(Diff
revision 1)

None

{'text': '  private void configureOptionsForWorkers() throws
ClassNotFoundException, IllegalAccessException,', 'line': 175}

   309

      log.error("Measure class not found " + measureClass, e);

  program should throw an exception and terminate if the distance
measure class cannot be found, right?

 Indeed. I removed the try/catch.


 On March 29th, 2013, 1:48 p.m., *Sebastian Schelter* wrote:


core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java<https://reviews.apache.org/r/10193/diff/1/?file=276345#file276345line315>
(Diff
revision 1)

None

{'text': '  private void configureOptionsForWorkers() throws
ClassNotFoundException, IllegalAccessException,', 'line': 175}

   315

      log.error("Searcher class not found " + measureClass, e);

  program should throw an exception and terminate if the searcher
class cannot be found, right?

 Yep, same as above.


- Dan

On March 29th, 2013, 4:03 p.m., Dan Filimon wrote:
  Review request for mahout, Ted Dunning and Sebastian Schelter.
By Dan Filimon.

*Updated March 29, 2013, 4:03 p.m.*
Description

This depends (loosely) on https://reviews.apache.org/r/10194/

This patch implements the MapReduce version of StreamingKMeans for MAHOUT-1154.

It adds 5 new classes:
- CentroidWritable: class representing a centroid that can be written
to a SeqFile
- StreamingKMeansDriver: class implementing AbstractJob that is the
entry point to the mapreduction
- StreamingKMeansMapper: mapper, running StreamingKMeans (see
MAHOUT-1162) clustering the points one by one
- StreamingKMeansReducer: reducer, running BallKMeans (see
MAHOUT-1162) a number of times and picking the clustering with the
lowest total clustering cost.
The cost is determined by randomly splitting the incoming centroids
into a "training" and "test" set, computing the centroids on the
training set and the cost on the test set. The intent is to see
whether the centroids actually describe the distribution of the points
or not.
- StreamingKMeansUtilMR: helper class with a method to instantiate a
searcher from a Configuration.

Additionally, there is a test class StreamingKMeansTestMR that tests
the mapper, reducer and mapper and reducer together using MRUnit.

!!!
Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as
a dependency. We depend on snapshot 1.0 which is not yet released (it
will be very soon), hence the updated pom.xml is not provided for now.
!!!

  Testing

See StreamingKMeansTestMR for the tests. These are all performed on
data sample from a "hypercube" distribution (there are multinormal
distributions in each vertex of the cube).
Additionally there are ongoing tests on the 20 newsgroups data set
(and some more are on the way).

  Diffs

   - 
core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java
   (PRE-CREATION)
   - 
core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
   (PRE-CREATION)
   - 
core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
   (PRE-CREATION)
   - 
core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java
   (PRE-CREATION)
   - 
core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java
   (PRE-CREATION)
   - 
core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java
   (PRE-CREATION)
   - src/conf/driver.classes.default.props (ac45eef)

View Diff <https://reviews.apache.org/r/10193/diff/>

Reply via email to