[jira] [Commented] (MAHOUT-1021) Blank csv input file given to Canopy/Kmeans clustering

Jeff Eastman (JIRA) Wed, 30 May 2012 07:18:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285690#comment-13285690
 ]


Jeff Eastman commented on MAHOUT-1021:
--------------------------------------

Well, here's what is happening with this job:

1. The canopy.Job driver run() method calls the InputDriver to convert the 
input .csv file into a Vector sequence file. It happily processes the empty 
file and produces a new empty sequence file.
2. The Job calls CanopyDriver.run passing the empty input file. This method 
first calls CanopyDriver.buildClusters() to extract clusters from the input 
file. The empty file is processed quickly and produces no clusters.
3. CanopyDriver.run then calls .clusterData(), which throws the exception since 
there are no clusters.

The same steps would occur with other algorithms used by this example, since 
they all call .clusterData() and it fails if there are no clusters. I'm not 
sure; however, exactly where to catch this particular user error.

- The current script downloads the .csv file, uploads it into Hadoop then runs 
the selected clustering job on it. But the script you called is no longer in 
trunk so the problem - if in the old script - no longer presents.
- The InputDriver could certainly check for an empty input directory, and this 
would be a friendly gesture.
- All of the clustering drivers could check for empty input directories, and 
this would be friendly too.

I think this is an excellent problem for a Mahout user who is interested in 
contributing as a developer. It is probably a single utility method (not sure 
where to put it right now) that checks for empty input and stops the job if it 
occurs. I suspect a number of Mahout algorithms would benefit and it would be 
well received.

So, I'm going to mark this for 'backlog', which is where we put tasks of this 
nature.


                
> Blank csv input file given to Canopy/Kmeans clustering
> ------------------------------------------------------
>
>                 Key: MAHOUT-1021
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1021
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Mahout 0.6 version on hadoop 0.2, Testing on 
> HadooponAzure platform
>            Reporter: Nabarun Sengupta
>              Labels: Bug, Clustering
>
> Hi,
> This is regarding a bug that we observed in Canopy clustering. We could 
> reflect the same in Kmeans too. Given a blank csv input file, we observe the 
> algorithm executes two jobs, during the third job execution, it throws an 
> error. When I tried to execute a malformed csv file with decimal or 
> characters, I received an error during the first job itself. Therefore, I 
> feel the same validation should be done if the input file is blank and 
> exception should be thrown during the first job execution.
> Following is the job execution details:
> Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
> ease select a number to choose the corresponding clustering algorithm"
>  canopy clustering"
>  kmeans clustering"
>  fuzzykmeans clustering"
>  dirichlet clustering"
>  meanshift clustering"
> er your choice:1
> . You chose 1 and we'll use canopy Clustering"
> S is healthy... "
> loading Synthetic control data to HDFS"
> eted hdfs://10.114.251.23:9000/user/milind/testdata
> ccessfully Uploaded Synthetic control data to HDFS "
> nning on hadoop, using HADOOP_HOME=c:\Apps\dist"
> Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar 
> org.apache.mahout.driver.MahoutDriver org.apache.mah
> ontrol.canopy.Job
> 05/17 10:46:11 WARN driver.MahoutDriver: No 
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on 
> classpath
> rguments only
> 05/17 10:46:11 INFO canopy.Job: Running with default arguments
> 05/17 10:46:12 INFO common.HadoopUtil: Deleting output
> 05/17 10:46:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool
> 05/17 10:46:13 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:46:14 INFO mapred.JobClient: Running job: job_201205170655_0017
> 05/17 10:46:15 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:46:48 INFO mapred.JobClient:  map 100% reduce 0%
> 05/17 10:46:59 INFO mapred.JobClient: Job complete: job_201205170655_0017
> 05/17 10:46:59 INFO mapred.JobClient: Counters: 15
> 05/17 10:46:59 INFO mapred.JobClient:   Job Counters
> 05/17 10:46:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=29672
> 05/17 10:46:59 INFO mapred.JobClient:     Total time spent by all reduces 
> waiting after reserving slots (ms)=0
> 05/17 10:46:59 INFO mapred.JobClient:     Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 05/17 10:46:59 INFO mapred.JobClient:     Launched map tasks=1
> 05/17 10:46:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 05/17 10:46:59 INFO mapred.JobClient:   File Output Format Counters
> 05/17 10:46:59 INFO mapred.JobClient:     Bytes Written=90
> 05/17 10:46:59 INFO mapred.JobClient:   FileSystemCounters
> 05/17 10:46:59 INFO mapred.JobClient:     FILE_BYTES_READ=130
> 05/17 10:46:59 INFO mapred.JobClient:     HDFS_BYTES_READ=134
> 05/17 10:46:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21557
> 05/17 10:46:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=90
> 05/17 10:46:59 INFO mapred.JobClient:   File Input Format Counters
> 05/17 10:46:59 INFO mapred.JobClient:     Bytes Read=0
> 05/17 10:46:59 INFO mapred.JobClient:   Map-Reduce Framework
> 05/17 10:46:59 INFO mapred.JobClient:     Map input records=0
> 05/17 10:46:59 INFO mapred.JobClient:     Spilled Records=0
> 05/17 10:46:59 INFO mapred.JobClient:     Map output records=0
> 05/17 10:46:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=134
> 05/17 10:46:59 INFO canopy.CanopyDriver: Build Clusters Input: output/data 
> Out: output Measure: org.apache.mahout.common.dist
> sure@6eedf759 t1: 80.0 t2: 55.0
> 05/17 10:46:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool
> 05/17 10:46:59 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:47:00 INFO mapred.JobClient: Running job: job_201205170655_0018
> 05/17 10:47:01 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:47:33 INFO mapred.JobClient:  map 100% reduce 0%
> 05/17 10:47:51 INFO mapred.JobClient:  map 100% reduce 100%
> 05/17 10:48:02 INFO mapred.JobClient: Job complete: job_201205170655_0018
> 05/17 10:48:02 INFO mapred.JobClient: Counters: 25
> 05/17 10:48:02 INFO mapred.JobClient:   Job Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Launched reduce tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30327
> 05/17 10:48:02 INFO mapred.JobClient:     Total time spent by all reduces 
> waiting after reserving slots (ms)=0
> 05/17 10:48:02 INFO mapred.JobClient:     Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 05/17 10:48:02 INFO mapred.JobClient:     Launched map tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     Data-local map tasks=1
> 05/17 10:48:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16031
> 05/17 10:48:02 INFO mapred.JobClient:   File Output Format Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Bytes Written=95
> 05/17 10:48:02 INFO mapred.JobClient:   FileSystemCounters
> 05/17 10:48:02 INFO mapred.JobClient:     FILE_BYTES_READ=396
> 05/17 10:48:02 INFO mapred.JobClient:     HDFS_BYTES_READ=217
> 05/17 10:48:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45263
> 05/17 10:48:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=95
> 05/17 10:48:02 INFO mapred.JobClient:   File Input Format Counters
> 05/17 10:48:02 INFO mapred.JobClient:     Bytes Read=90
> 05/17 10:48:02 INFO mapred.JobClient:   Map-Reduce Framework
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce input groups=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output materialized bytes=6
> 05/17 10:48:02 INFO mapred.JobClient:     Combine output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map input records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Spilled Records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output bytes=0
> 05/17 10:48:02 INFO mapred.JobClient:     Combine input records=0
> 05/17 10:48:02 INFO mapred.JobClient:     Map output records=0
> 05/17 10:48:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
> 05/17 10:48:02 INFO mapred.JobClient:     Reduce input records=0
> 05/17 10:48:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool
> 05/17 10:48:03 INFO input.FileInputFormat: Total input paths to process : 1
> 05/17 10:48:03 INFO mapred.JobClient: Running job: job_201205170655_0019
> 05/17 10:48:04 INFO mapred.JobClient:  map 0% reduce 0%
> 05/17 10:48:35 INFO mapred.JobClient: Task Id : 
> attempt_201205170655_0019_m_000000_0, Status : FAILED
> a.lang.IllegalStateException: Canopies are empty!
>      at 
> org.apache.mahout.clustering.canopy.ClusterMapper.setup(ClusterMapper.java:81)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>      at org.Aapache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>      at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:415)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>      at org.apache.hadoop.mapred.Child.main(Child.java:260)
> empt_201205170655_0019_m_000000_0: log4j:WARN No appenders could be found for 
> logger (org.apache.hadoop.hdfs.DFSClient).
> empt_201205170655_0019_m_000000_0: log4j:WARN Please initialize the log4j 
> system properly.
> 05/17 10:48:53 INFO mapred.JobClient: Task Id : 
> attempt_201205170655_0019_m_000000_1, Status : FAILED
> a.lang.IllegalStateException: Canopies are empty!
>      at 
> org.apache.mahout.clustering.canopy.ClusterMapper.setup(ClusterMapper.java:81)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>      at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:415)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>      at org.apache.hadoop.mapred.Child.main(Child.java:260)
> empt_201205170655_0019_m_000000_1: log4j:WARN No appenders could be found for 
> logger (org.apache.hadoop.hdfs.DFSClient).
> empt_201205170655_0019_m_000000_1: log4j:WARN Please initialize the log4j 
> system properly.
> minate batch job (Y/N)? ^V
> Please let me know if this issue can be resolved. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1021) Blank csv input file given to Canopy/Kmeans clustering

Reply via email to