Re: question about processing XML file

Bibek Paudel Tue, 12 Oct 2010 13:57:13 -0700

On Tue, Oct 12, 2010 at 7:28 PM, Paul Ingles <[email protected]> wrote:
> I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, 
> I posted a small blog article on it a while back: 
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>
> You could either add the dependency on the mahout jars or copy the class 
> source and compile in your tree.
>


I have read  your post. Thanks.

Could you please tell me how i can "add the dependency on the mahout
jars"? is it by using the "-libjar" option in the commandline?

Thanks,
Bibek

> Hth,
> Paul
>
> Sent from my iPhone
>
> On 12 Oct 2010, at 18:10, Steve Lewis <[email protected]> wrote:
>
>> Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
>> and org.apache.hadoop.mapreduce.lib.input.TextInputFormat
>>
>> What you need to do  is copy those and change the LineRecordReader to look
>> for the <page> tag
>>
>> On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <[email protected]>wrote:
>>
>>> Hi,
>>> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
>>> in a single node cluster. I have already run mapreduce programs for
>>> wordcount and building inverted index.
>>>
>>> I am trying to run the wordcount program in a wikipedia dump. It is a
>>> single XML file with Wikipedia pages' data in the following form:
>>>
>>> <page>
>>>   <title>Amr El Halwani</title>
>>>   <id>16000008</id>
>>>   <revision>
>>>     <id>368385014</id>
>>>     <timestamp>2010-06-16T13:32:28Z</timestamp>
>>>     <text xml:space="preserve">
>>>             Some multi-line text goes here.
>>>     </text>
>>> </page>
>>>
>>>
>>> I want to do wordcount of the text contained inside the tags <text>
>>> and </text>. Please let me know what is the correct way of doing this.
>>>
>>> What works:
>>> ------------------
>>> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>>>
>>> Straight out of documentation, the following also works:
>>>
>>> ---------------------------------------------------------------------------------
>>> $HADOOP_HOME/bin/hadoop jar
>>> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
>>> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
>>> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc
>>>
>>> What I am interested in doing is:
>>> -------------------------------------------------
>>> 1. use my java classes in WordCount.jar (or something similar) as
>>> mapper and reducer (and driver).
>>> 2. if possible, pass the configuration options, like begin and end
>>> tags of XML from inside my Java program itself.
>>> 3. if possible, specify my intent to use StreamXmlRecordReader from
>>> inside the java program itself.
>>>
>>> Please let me know what I should read/do to solve these issues.
>>>
>>> Bibek
>>> Bibek
>>>
>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave Ne
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Institute for Systems Biology
>> Seattle WA
>

Re: question about processing XML file

Reply via email to