Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader and org.apache.hadoop.mapreduce.lib.input.TextInputFormat
What you need to do is copy those and change the LineRecordReader to look for the <page> tag On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <[email protected]>wrote: > Hi, > I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode > in a single node cluster. I have already run mapreduce programs for > wordcount and building inverted index. > > I am trying to run the wordcount program in a wikipedia dump. It is a > single XML file with Wikipedia pages' data in the following form: > > <page> > <title>Amr El Halwani</title> > <id>16000008</id> > <revision> > <id>368385014</id> > <timestamp>2010-06-16T13:32:28Z</timestamp> > <text xml:space="preserve"> > Some multi-line text goes here. > </text> > </page> > > > I want to do wordcount of the text contained inside the tags <text> > and </text>. Please let me know what is the correct way of doing this. > > What works: > ------------------ > $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2 > > Straight out of documentation, the following also works: > > --------------------------------------------------------------------------------- > $HADOOP_HOME/bin/hadoop jar > contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader > "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head > -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc > > What I am interested in doing is: > ------------------------------------------------- > 1. use my java classes in WordCount.jar (or something similar) as > mapper and reducer (and driver). > 2. if possible, pass the configuration options, like begin and end > tags of XML from inside my Java program itself. > 3. if possible, specify my intent to use StreamXmlRecordReader from > inside the java program itself. > > Please let me know what I should read/do to solve these issues. > > Bibek > Bibek > -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA
