Re: DictionaryVectorizer meets Wikipedia.

Olivier Grisel Thu, 14 Jan 2010 02:40:26 -0800

2010/1/13 Robin Anil <[email protected]>:
> I have fired up a small instance of EC2(Single node for the moment) and have
> been dabbling with the latest XML dump of the articles base of Wikipedia
>
> wiki XML is around 25GB which was split into 128MB chunks and stored on hdfs
> WikipediaToSequenceFile class runs an M/R job to convert articles(without
> redirects) into sequence file format (took 6 hours over entire wikipedia)
> produced a Gzip Block compressed sequence file of 6Gb
> The bottleneck i found there was that the current XMLInputFormat which is
> checked in examples is reading byte by byte to search for start tag and end
> tag
>
> I am currently running the word count of the  DictionaryVectorizer(over the
> gzip compressed 6GB data) where I see that cpu cycles are spent on one thing
> only. i.e TokenStream.next(Token)  (StandardAnalyzer). This job is also
> estimated to take 6 hours on that small instance. Beyond which multiple
> map/reduce jobs calculates the the partial vectors. Each of those iterations
> will take 6 hours more.
>
> If anyone has some idea on how to speed up both these bottlenecks(other than
> running more instances :P), Please give some insight.
>
> the page is here
> http://ec2-67-202-51-4.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201001132019_0004&refresh=30


Interesting. I plan to try to do similar processing using Amazon
Elastic Cloud (to spare the burden of installing and tuning a Hadoop
cluster myself), hence using the s3fs instead of regular HDFS. Anyone
has metrics on such a setup (S3FS vs HDFS w.r.t. number of nodes for a
text tokenization task)?

While we are at it, it seems that TokenStream.next(Token) is
deprecated and that it should be updated as indicated in the following
patch. I have no clue whether this has an impact on the perfs by if it
works I guess the mahout code should be upgraded to be future proof.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Index: examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
===================================================================
--- examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java	(revision 899149)
+++ examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java	(working copy)
@@ -34,8 +34,8 @@
 import org.apache.hadoop.mapred.Reporter;
 import org.apache.hadoop.util.GenericsUtil;
 import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.Token;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 import org.apache.mahout.analysis.WikipediaAnalyzer;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -77,9 +77,9 @@
           .replaceAll(""));
       TokenStream stream = analyzer.tokenStream(catMatch, new StringReader(
           document));
-      Token token = new Token();
-      while ((token = stream.next(token)) != null) {
-        contents.append(token.termBuffer(), 0, token.termLength()).append(' ');
+      TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class);
+      while (stream.incrementToken()) {
+        contents.append(termAtt.termBuffer(), 0, termAtt.termLength()).append(' ');
       }
       output.collect(new Text(SPACE_NON_ALPHA_PATTERN.matcher(catMatch)
           .replaceAll("_")), new Text(contents.toString()));

Re: DictionaryVectorizer meets Wikipedia.

Reply via email to