Thank you! On 2/9/11, Timothy Potter (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992857#comment-12992857 > ] > > Timothy Potter commented on MAHOUT-588: > --------------------------------------- > > Here are the steps I take to vectorize using Amazon's Elastic MapReduce. > > 1. Install elastic-mapreduce-ruby tool: > > On Debian-based Linux: > > sudo apt-get install ruby1.8 > sudo apt-get install libopenssl-ruby1.8 > sudo apt-get install libruby1.8-extras > > Once these dependencies are installed, download and extract the > elastic-mapreduce-ruby app: > > mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads > cd /mnt/dev/downloads > wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip > cd /mnt/dev/elastic-mapreduce > unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip > > # create a file named credentials.json in /mnt/dev/elastic-mapreduce > # see: > http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1 > # credentials.json should contain the following, note the region is > significant > > { > "access-id": "ACCESS_KEY", > "private-key": "SECRET_KEY", > "key-pair": "gsg-keypair", > "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem", > "region": "us-east-1", > "log-uri": "s3n://BUCKET/asf-mail-archives/logs/" > } > > Also, it's a good idea to add /mnt/dev/elastic-mapreduce to your PATH > > 2. Once elastic-mapreduce is installed, start a cluster with no jobflow > steps yet: > > elastic-mapreduce --create --alive \ > --log-uri s3n://BUCKET/asf-mail-archives/logs/ \ > --key-pair gsg-keypair \ > --slave-instance-type m1.xlarge \ > --master-instance-type m1.xlarge \ > --num-instances # \ > --name mahout-0.4-vectorize \ > --bootstrap-action > s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive > > This will create an EMR Job Flow named "mahout-0.4-vectorize" in the US-East > region. Take note of > the Job ID returned as you will need it to add the "seq2sparse" step to the > Job Flow. > > I'll leave it to you to decide how many instances to allocate, but keep in > mind that one will be > dedicated as the master. Also, it took about 75 minutes to run the > seq2sparse job on 19 xlarge > instances (~190 normalized instance hours -- not cheap). I think you'll be > safe to use about 10-13 > instances and still finish in under 2 hours. > > Also, notice I'm using Amazon's bootstrap-action for configuring the cluster > to run memory intensive > jobs. For more information about this, see: > http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834 > > 3. Mahout JAR > > The Mahout 0.4 Jobs JAR with our TamingAnalyzer is available at: > s3://thelabdude/mahout-examples-0.4-job-tt.jar > > If you need to change other Mahout code, then you'll need to post your own > JAR to S3. > Remember to reference the JAR using the s3n Hadoop protocol. > > 4. Schedule a jobflow step to vectorize using Mahout's seq2sparse: > > elastic-mapreduce --jar s3n://thelabdude/mahout-examples-0.4-job-tt.jar \ > --main-class org.apache.mahout.driver.MahoutDriver \ > --arg seq2sparse \ > --arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/ > \ > --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \ > --arg --weight --arg tfidf \ > --arg --chunkSize --arg 100 \ > --arg --minSupport --arg 400 \ > --arg --minDF --arg 20 \ > --arg --maxDFPercent --arg 80 \ > --arg --norm --arg 2 \ > --arg --numReducers --arg ## \ > --arg --analyzerName --arg org.apache.mahout.text.TamingAnalyzer \ > --arg --maxNGramSize --arg 2 \ > --arg --minLLR --arg 50 \ > --enable-debugging \ > -j JOB_ID > > These settings are pretty aggressive in order to reduce the vectors > to around 100,000 dimensions. > > IMPORTANT: Set the number of reducers to 2 x (N-1) (where N is the size of > your cluster) > > The job will send output to HDFS instead of S3 (see Mahout-598). Once the > job > completes, we'll copy the results to S3 from our cluster's HDFS using > distcp. > > NOTE: To monitor the status of the job, use: > elastic-mapreduce --logs -j JOB_ID > > 5. Save log after completion > > Once the job completes, save the log output for further analysis: > > elastic-mapreduce --logs -j JOB_ID > seq2sparse.log > > 6. SSH into the master node to run distcp: > > elastic-mapreduce --ssh -j JOB_ID > > hadoop fs -lsr /asf-mail-archives/mahout-0.4/vectors/ > hadoop distcp /asf-mail-archives/mahout-0.4/vectors/ > s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/ > & > > Note: You will need all the output from the vectorize step in order to run > Mahout's clusterdump. > > 7. Shut down your cluster > > Once you've copied the seq2sparse output to S3, you can shutdown your > cluster. > > elastic-mapreduce --terminate -j JOB_ID > > Verify the cluster is terminated in your Amazon console. > > 8. Make the vectors public in S3 using the Amazon console or s3cmd: > > s3cmd setacl --acl-public --recursive > s3://BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/ > > 9. Dump out the size of the vectors > > bin/mahout vectordump --seqFile > s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/tfidf-vectors/part-r-00000 > --sizeOnly | more > > >> Benchmark Mahout's clustering performance on EC2 and publish the results >> ------------------------------------------------------------------------ >> >> Key: MAHOUT-588 >> URL: https://issues.apache.org/jira/browse/MAHOUT-588 >> Project: Mahout >> Issue Type: Task >> Reporter: Grant Ingersoll >> Attachments: >> 60_clusters_kmeans_10_iterations_100K_coordinates.txt, >> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, >> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, >> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, >> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, >> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, >> Uncompress.java, clusters1.txt, clusters_kMeans.txt, >> distcp_large_to_s3_failed.log, ec2_setup_notes.txt, >> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log >> >> >> For Taming Text, I've commissioned some benchmarking work on Mahout's >> clustering algorithms. I've asked the two doing the project to do all the >> work in the open here. The goal is to use a publicly reusable dataset >> (for now, the ASF mail archives, assuming it is big enough) and run on EC2 >> and make all resources available so others can reproduce/improve. >> I'd like to add the setup code to utils (although it could possibly be >> done as a Vectorizer) and the publication of the results will be put up on >> the Wiki as well as in the book. This issue is to track the patches, etc. > > -- > This message is automatically generated by JIRA. > - > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
-- Lance Norskog [email protected]
