[jira] [Commented] (MAHOUT-694) IndexOutOfBoundException using build-reuters.sh

Drew Farris (JIRA) Fri, 20 May 2011 09:31:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036915#comment-13036915
 ]


Drew Farris commented on MAHOUT-694:
------------------------------------

Thanks for taking a look Grant. I suspected I was as simple as slash, but my 
hadoop cluster wasn't behaving last night in order for me to complete the test.

When running today, it works fine in local mode, but not in distributed mode. 
In this case, the sequence file in reuters-out-seqdir on hdfs is empty.

I suspect this is because seqdirectory is attempting to find its input on hdfs 
instead of the local disk. I haven't tested this assumption yet however.

If this were the case, one approach would be to copy the contents of 
reuters-out up to hdfs when running in distributed mode prior to executing 
seqdirectory. 

I suspect it is better to force seqdirectory to read off a local disk and write 
to hdfs. Failing that, force seqdirectory to run in local mode and put the 
result on to hdfs before executing seq2sparse.



> IndexOutOfBoundException using build-reuters.sh
> -----------------------------------------------
>
>                 Key: MAHOUT-694
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-694
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>         Environment: Linux Debian Lenny
> Hadoop 0.20 (Cloudera)
>            Reporter: Allan BLANCHARD
>            Assignee: Drew Farris
>             Fix For: 0.5
>
>         Attachments: MAHOUT-694.patch
>
>
> I run Hadoop-0.20 on distributed mode on 10 VMs (NameNode + JobTracker + 8 
> DataNodes/TaskTrackers) with Mahout trunk.
> I tried to test kmeans example with build-reuters.sh but I got an 
> IndexOutOfBoundException when it starts kmeans.
> I don't know which operation fails ... ExtractReuters, seqdirectory, 
> seq2sparse or kmeans. Maybe I forgot a configuration ? I searched on the web 
> and didn't find solutions ... 
> ------------------------ UPDATE == 05/16 -------------------------
> NameNode:/usr/local/mahout/trunk/examples/bin# ./build-reuters.sh 
> Please select a number to choose the corresponding clustering algorithm
> 1. kmeans clustering
> 2. lda clustering
> Enter your choice : 1
> ok. You chose 1 and we'll use kmeans Clustering
> ./build-reuters.sh: line 39: cd: examples/bin/: No such file or directory
> Downloading Reuters-21578
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 7959k  100 7959k    0     0   121k      0  0:01:05  0:01:05 --:--:--   99k
> Extracting...
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:20 WARN driver.MahoutDriver: No 
> org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath, 
> will use command-line arguments only
> Deleting all files in ./examples/bin/work/reuters-out/-tmp
> 11/05/16 09:31:24 INFO driver.MahoutDriver: Program took 3471 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:26 INFO common.AbstractJob: Command line arguments: 
> {--charset=UTF-8, --chunkSize=5, --endPhase=2147483647, 
> --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, 
> --input=./examples/bin/work/reuters-out/, --keyPrefix=, 
> --output=./examples/bin/work/reuters-out-seqdir, --startPhase=0, 
> --tempDir=temp}
> 11/05/16 09:31:26 INFO driver.MahoutDriver: Program took 398 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum 
> n-gram size is: 1
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR 
> value: 1.0
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of 
> reduce tasks: 1
> 11/05/16 09:31:29 INFO input.FileInputFormat: Total input paths to process : 1
> 11/05/16 09:31:29 INFO mapred.JobClient: Running job: job_201105160929_0001
> 11/05/16 09:31:30 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/16 09:31:40 INFO mapred.JobClient:  map 100% reduce 0%
> 11/05/16 09:31:42 INFO mapred.JobClient: Job complete: job_201105160929_0001
> [...]
> 11/05/16 09:33:58 INFO common.HadoopUtil: Deleting 
> examples/bin/work/reuters-out-seqdir-sparse/partial-vectors-0
> 11/05/16 09:33:58 INFO driver.MahoutDriver: Program took 149846 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:34:00 INFO common.AbstractJob: Command line arguments: 
> {--clusters=./examples/bin/work/clusters, --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>  --endPhase=2147483647, 
> --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/, 
> --maxIter=10, --method=mapreduce, --numClusters=20, 
> --output=./examples/bin/work/reuters-kmeans, --overwrite=null, 
> --startPhase=0, --tempDir=temp}
> 11/05/16 09:34:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 11/05/16 09:34:00 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> 11/05/16 09:34:00 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, 
> Size: 0
>       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>       at java.util.ArrayList.get(ArrayList.java:322)
>       at 
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> ------------------------------------------------------------------
> EDIT : I just tried this on Mahout 0.4 and it seems to work (I use the same 
> VM configuration). 
> PS : Sorry for my very bad english :(

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-694) IndexOutOfBoundException using build-reuters.sh

Reply via email to