question about processing XML file

Bibek Paudel Tue, 12 Oct 2010 05:03:32 -0700

Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.


I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file with Wikipedia pages' data in the following form:

  <page>
    <title>Amr El Halwani</title>
    <id>16000008</id>
    <revision>
      <id>368385014</id>
      <timestamp>2010-06-16T13:32:28Z</timestamp>
      <text xml:space="preserve">
              Some multi-line text goes here.
      </text>
  </page>


I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.

What works:
------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Straight out of documentation, the following also works:
---------------------------------------------------------------------------------
$HADOOP_HOME/bin/hadoop jar
contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
"StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
-output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

What I am interested in doing is:
-------------------------------------------------
1. use my java classes in WordCount.jar (or something similar) as
mapper and reducer (and driver).
2. if possible, pass the configuration options, like begin and end
tags of XML from inside my Java program itself.
3. if possible, specify my intent to use StreamXmlRecordReader from
inside the java program itself.

Please let me know what I should read/do to solve these issues.

Bibek
Bibek

question about processing XML file

Reply via email to