Re: [jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Lance Norskog Thu, 10 Feb 2011 01:11:49 -0800

Thank you!

On 2/9/11, Timothy Potter (JIRA) <[email protected]> wrote:
>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992857#comment-12992857
> ]
>
> Timothy Potter commented on MAHOUT-588:
> ---------------------------------------
>
> Here are the steps I take to vectorize using Amazon's Elastic MapReduce.
>
> 1. Install elastic-mapreduce-ruby tool:
>
> On Debian-based Linux:
>
> sudo apt-get install ruby1.8
> sudo apt-get install libopenssl-ruby1.8
> sudo apt-get install libruby1.8-extras
>
> Once these dependencies are installed, download and extract the
> elastic-mapreduce-ruby app:
>
> mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads
> cd /mnt/dev/downloads
> wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
> cd /mnt/dev/elastic-mapreduce
> unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip
>
> # create a file named credentials.json in /mnt/dev/elastic-mapreduce
> # see:
> http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1
> # credentials.json should contain the following, note the region is
> significant
>
> {
>   "access-id":     "ACCESS_KEY",
>   "private-key":   "SECRET_KEY",
>   "key-pair":      "gsg-keypair",
>   "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem",
>   "region":        "us-east-1",
>   "log-uri":       "s3n://BUCKET/asf-mail-archives/logs/"
> }
>
> Also, it's a good idea to add /mnt/dev/elastic-mapreduce to your PATH
>
> 2. Once elastic-mapreduce is installed, start a cluster with no jobflow
> steps yet:
>
> elastic-mapreduce --create --alive \
>   --log-uri s3n://BUCKET/asf-mail-archives/logs/ \
>   --key-pair gsg-keypair \
>   --slave-instance-type m1.xlarge \
>   --master-instance-type m1.xlarge \
>   --num-instances # \
>   --name mahout-0.4-vectorize \
>   --bootstrap-action
> s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
>
> This will create an EMR Job Flow named "mahout-0.4-vectorize" in the US-East
> region. Take note of
> the Job ID returned as you will need it to add the "seq2sparse" step to the
> Job Flow.
>
> I'll leave it to you to decide how many instances to allocate, but keep in
> mind that one will be
> dedicated as the master. Also, it took about 75 minutes to run the
> seq2sparse job on 19 xlarge
> instances (~190 normalized instance hours -- not cheap). I think you'll be
> safe to use about 10-13
> instances and still finish in under 2 hours.
>
> Also, notice I'm using Amazon's bootstrap-action for configuring the cluster
> to run memory intensive
> jobs. For more information about this, see:
> http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834
>
> 3. Mahout JAR
>
> The Mahout 0.4 Jobs JAR with our TamingAnalyzer is available at:
> s3://thelabdude/mahout-examples-0.4-job-tt.jar
>
> If you need to change other Mahout code, then you'll need to post your own
> JAR to S3.
> Remember to reference the JAR using the s3n Hadoop protocol.
>
> 4. Schedule a jobflow step to vectorize using Mahout's seq2sparse:
>
> elastic-mapreduce --jar s3n://thelabdude/mahout-examples-0.4-job-tt.jar \
> --main-class org.apache.mahout.driver.MahoutDriver \
> --arg seq2sparse \
> --arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files/
> \
> --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \
> --arg --weight --arg tfidf \
> --arg --chunkSize --arg 100 \
> --arg --minSupport --arg 400 \
> --arg --minDF --arg 20 \
> --arg --maxDFPercent --arg 80 \
> --arg --norm --arg 2 \
> --arg --numReducers --arg ## \
> --arg --analyzerName --arg org.apache.mahout.text.TamingAnalyzer \
> --arg --maxNGramSize --arg 2 \
> --arg --minLLR --arg 50 \
> --enable-debugging \
> -j JOB_ID
>
> These settings are pretty aggressive in order to reduce the vectors
> to around 100,000 dimensions.
>
> IMPORTANT: Set the number of reducers to 2 x (N-1) (where N is the size of
> your cluster)
>
> The job will send output to HDFS instead of S3 (see Mahout-598). Once the
> job
> completes, we'll copy the results to S3 from our cluster's HDFS using
> distcp.
>
> NOTE: To monitor the status of the job, use:
> elastic-mapreduce --logs -j JOB_ID
>
> 5. Save log after completion
>
> Once the job completes, save the log output for further analysis:
>
> elastic-mapreduce --logs -j JOB_ID > seq2sparse.log
>
> 6. SSH into the master node to run distcp:
>
> elastic-mapreduce --ssh -j JOB_ID
>
> hadoop fs -lsr /asf-mail-archives/mahout-0.4/vectors/
> hadoop distcp /asf-mail-archives/mahout-0.4/vectors/
> s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/
> &
>
> Note: You will need all the output from the vectorize step in order to run
> Mahout's clusterdump.
>
> 7. Shut down your cluster
>
> Once you've copied the seq2sparse output to S3, you can shutdown your
> cluster.
>
> elastic-mapreduce --terminate -j JOB_ID
>
> Verify the cluster is terminated in your Amazon console.
>
> 8. Make the vectors public in S3 using the Amazon console or s3cmd:
>
> s3cmd setacl --acl-public --recursive
> s3://BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/
>
> 9. Dump out the size of the vectors
>
> bin/mahout vectordump --seqFile
> s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/sparse-2-gram-stem/tfidf-vectors/part-r-00000
> --sizeOnly | more
>
>
>> Benchmark Mahout's clustering performance on EC2 and publish the results
>> ------------------------------------------------------------------------
>>
>>                 Key: MAHOUT-588
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>>             Project: Mahout
>>          Issue Type: Task
>>            Reporter: Grant Ingersoll
>>         Attachments:
>> 60_clusters_kmeans_10_iterations_100K_coordinates.txt,
>> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java,
>> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java,
>> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java,
>> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java,
>> TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords,
>> Uncompress.java, clusters1.txt, clusters_kMeans.txt,
>> distcp_large_to_s3_failed.log, ec2_setup_notes.txt,
>> seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>>
>>
>> For Taming Text, I've commissioned some benchmarking work on Mahout's
>> clustering algorithms.  I've asked the two doing the project to do all the
>> work in the open here.  The goal is to use a publicly reusable dataset
>> (for now, the ASF mail archives, assuming it is big enough) and run on EC2
>> and make all resources available so others can reproduce/improve.
>> I'd like to add the setup code to utils (although it could possibly be
>> done as a Vectorizer) and the publication of the results will be put up on
>> the Wiki as well as in the book.  This issue is to track the patches, etc.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>



-- 
Lance Norskog
[email protected]

Re: [jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Reply via email to