[jira] Updated: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Timothy Potter (JIRA) Fri, 25 Feb 2011 14:55:47 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Potter updated MAHOUT-588:
----------------------------------

    Attachment: ec2_setup_notes_v2.txt
                prep_asf_mail_archives.sh

Thanks for the instructions Isabel. The only problem I see is that the current 
EC2 wiki is primarily based around creating your own Hadoop AMI, whereas my 
instructions are based on using an existing Hadoop 0.20.2 AMI from bixolabs (S3 
bucket: 453820947548/bixolabs-public-amis). Moreover, I think our process is 
much easier, but the process that is currently on the wiki is still valid.

My updated notes are attached along with the setup script I used to create the 
SequenceFiles.

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, 
> SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, 
> TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, 
> TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, 
> TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, 
> TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, 
> TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, 
> clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, 
> ec2_setup_notes.txt, ec2_setup_notes_v2.txt, mahout-588_distribution.pdf, 
> prep_asf_mail_archives.sh, seq2sparse_small_failed.log, 
> seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's 
> clustering algorithms.  I've asked the two doing the project to do all the 
> work in the open here.  The goal is to use a publicly reusable dataset (for 
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and 
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done 
> as a Vectorizer) and the publication of the results will be put up on the 
> Wiki as well as in the book.  This issue is to track the patches, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

Reply via email to