2010/1/13 Robin Anil <[email protected]>:
> I have fired up a small instance of EC2(Single node for the moment) and have
> been dabbling with the latest XML dump of the articles base of Wikipedia
>
> wiki XML is around 25GB which was split into 128MB chunks and stored on hdfs
> WikipediaToSequenceFile class runs an M/R job to convert articles(without
> redirects) into sequence file format (took 6 hours over entire wikipedia)
> produced a Gzip Block compressed sequence file of 6Gb
> The bottleneck i found there was that the current XMLInputFormat which is
> checked in examples is reading byte by byte to search for start tag and end
> tag
>
> I am currently running the word count of the DictionaryVectorizer(over the
> gzip compressed 6GB data) where I see that cpu cycles are spent on one thing
> only. i.e TokenStream.next(Token) (StandardAnalyzer). This job is also
> estimated to take 6 hours on that small instance. Beyond which multiple
> map/reduce jobs calculates the the partial vectors. Each of those iterations
> will take 6 hours more.
>
> If anyone has some idea on how to speed up both these bottlenecks(other than
> running more instances :P), Please give some insight.
>
> the page is here
> http://ec2-67-202-51-4.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201001132019_0004&refresh=30
Interesting. I plan to try to do similar processing using Amazon
Elastic Cloud (to spare the burden of installing and tuning a Hadoop
cluster myself), hence using the s3fs instead of regular HDFS. Anyone
has metrics on such a setup (S3FS vs HDFS w.r.t. number of nodes for a
text tokenization task)?
While we are at it, it seems that TokenStream.next(Token) is
deprecated and that it should be updated as indicated in the following
patch. I have no clue whether this has an impact on the perfs by if it
works I guess the mahout code should be upgraded to be future proof.
--
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name
Index: examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
===================================================================
--- examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java (revision 899149)
+++ examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java (working copy)
@@ -34,8 +34,8 @@
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.GenericsUtil;
import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.mahout.analysis.WikipediaAnalyzer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@@ -77,9 +77,9 @@
.replaceAll(""));
TokenStream stream = analyzer.tokenStream(catMatch, new StringReader(
document));
- Token token = new Token();
- while ((token = stream.next(token)) != null) {
- contents.append(token.termBuffer(), 0, token.termLength()).append(' ');
+ TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class);
+ while (stream.incrementToken()) {
+ contents.append(termAtt.termBuffer(), 0, termAtt.termLength()).append(' ');
}
output.collect(new Text(SPACE_NON_ALPHA_PATTERN.matcher(catMatch)
.replaceAll("_")), new Text(contents.toString()));