Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.
I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file where each line contains a Wikipedia page in the
following format:
<page> <title>Main Page</title> <text>Some text goes
here.</text> </page>
I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.
When I enter the following command, I get an error. The jar file, the
WordCount class and input file all exist.
$HADOOP_HOME/bin/hadoop jar WordCount.jar -inputformat
"org.apache.hadoop.mapreduce.StreamInputFormat"
-Dstream.recordreader.class=org.apache.hadoop.streaming.StreamXmlRecordReader
-inputreader "StreamXmlRecordReader,begin=<text>,end=</text>"
WordCount wikixml wikixml-op2
Error:
-----------
Exception in thread "main" java.lang.ClassNotFoundException: -inputformat
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
What used to work:
----------------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
Thanks for any help,
Bibek