On Tue, Oct 12, 2010 at 7:28 PM, Paul Ingles <[email protected]> wrote: > I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, > I posted a small blog article on it a while back: > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html > > You could either add the dependency on the mahout jars or copy the class > source and compile in your tree. >
I have read your post. Thanks. Could you please tell me how i can "add the dependency on the mahout jars"? is it by using the "-libjar" option in the commandline? Thanks, Bibek > Hth, > Paul > > Sent from my iPhone > > On 12 Oct 2010, at 18:10, Steve Lewis <[email protected]> wrote: > >> Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader >> and org.apache.hadoop.mapreduce.lib.input.TextInputFormat >> >> What you need to do is copy those and change the LineRecordReader to look >> for the <page> tag >> >> On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <[email protected]>wrote: >> >>> Hi, >>> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode >>> in a single node cluster. I have already run mapreduce programs for >>> wordcount and building inverted index. >>> >>> I am trying to run the wordcount program in a wikipedia dump. It is a >>> single XML file with Wikipedia pages' data in the following form: >>> >>> <page> >>> <title>Amr El Halwani</title> >>> <id>16000008</id> >>> <revision> >>> <id>368385014</id> >>> <timestamp>2010-06-16T13:32:28Z</timestamp> >>> <text xml:space="preserve"> >>> Some multi-line text goes here. >>> </text> >>> </page> >>> >>> >>> I want to do wordcount of the text contained inside the tags <text> >>> and </text>. Please let me know what is the correct way of doing this. >>> >>> What works: >>> ------------------ >>> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2 >>> >>> Straight out of documentation, the following also works: >>> >>> --------------------------------------------------------------------------------- >>> $HADOOP_HOME/bin/hadoop jar >>> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader >>> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head >>> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc >>> >>> What I am interested in doing is: >>> ------------------------------------------------- >>> 1. use my java classes in WordCount.jar (or something similar) as >>> mapper and reducer (and driver). >>> 2. if possible, pass the configuration options, like begin and end >>> tags of XML from inside my Java program itself. >>> 3. if possible, specify my intent to use StreamXmlRecordReader from >>> inside the java program itself. >>> >>> Please let me know what I should read/do to solve these issues. >>> >>> Bibek >>> Bibek >>> >> >> >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave Ne >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Institute for Systems Biology >> Seattle WA >
