question about processing XML file

Bibek Paudel Fri, 08 Oct 2010 04:48:57 -0700

Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.


I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file where each line contains a Wikipedia page in the
following format:

<page>     <title>Main Page</title>    <text>Some text goes
here.</text>    </page>

I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.

When I enter the following command, I get an error. The jar file, the
WordCount class and input file all exist.

$HADOOP_HOME/bin/hadoop jar WordCount.jar -inputformat
"org.apache.hadoop.mapreduce.StreamInputFormat"
-Dstream.recordreader.class=org.apache.hadoop.streaming.StreamXmlRecordReader
 -inputreader "StreamXmlRecordReader,begin=<text>,end=</text>"
WordCount wikixml wikixml-op2

Error:
-----------
Exception in thread "main" java.lang.ClassNotFoundException: -inputformat
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

What used to work:
----------------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Thanks for any help,
Bibek

question about processing XML file

Reply via email to