Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
and org.apache.hadoop.mapreduce.lib.input.TextInputFormat

What you need to do  is copy those and change the LineRecordReader to look
for the <page> tag

On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <[email protected]>wrote:

> Hi,
> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
> in a single node cluster. I have already run mapreduce programs for
> wordcount and building inverted index.
>
> I am trying to run the wordcount program in a wikipedia dump. It is a
> single XML file with Wikipedia pages' data in the following form:
>
>  <page>
>    <title>Amr El Halwani</title>
>    <id>16000008</id>
>    <revision>
>      <id>368385014</id>
>      <timestamp>2010-06-16T13:32:28Z</timestamp>
>      <text xml:space="preserve">
>              Some multi-line text goes here.
>      </text>
>  </page>
>
>
> I want to do wordcount of the text contained inside the tags <text>
> and </text>. Please let me know what is the correct way of doing this.
>
> What works:
> ------------------
> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>
> Straight out of documentation, the following also works:
>
> ---------------------------------------------------------------------------------
> $HADOOP_HOME/bin/hadoop jar
> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc
>
> What I am interested in doing is:
> -------------------------------------------------
> 1. use my java classes in WordCount.jar (or something similar) as
> mapper and reducer (and driver).
> 2. if possible, pass the configuration options, like begin and end
> tags of XML from inside my Java program itself.
> 3. if possible, specify my intent to use StreamXmlRecordReader from
> inside the java program itself.
>
> Please let me know what I should read/do to solve these issues.
>
> Bibek
> Bibek
>



-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA

Reply via email to