Re: Regarding loading a big XML file to HDFS

Inder Pall Mon, 21 Nov 2011 20:01:57 -0800

what about the records at skipped boundaries?
Instead is there a way to define a custom splitter in hadoop which can
understand record boundaries.


- Inder

On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <[email protected]>wrote:

>
> Just wanted to address this:
> > >Basically in My mapreduce program i am expecting a complete XML as my
> > >input.i have a CustomReader(for XML) in my mapreduce job
> configuration.My
> > >main confusion is if namenode distribute data to DataNodes ,there is a
> > >chance that a part of xml can go to one data node and other half can go
> in
> > >another datanode.If that is the case will my custom XMLReader in the
> > >mapreduce be able to combine it(as mapreduce reads data locally only).
> > >Please help me on this?
> >
> > if you can not do anything parallel here, make your input split size to
> cover complete file size.
> >
>  also configure the block size to cover complete file size. In this
> case, only one mapper and reducer will be spawned for file. But here you
>  wont get any parallel processing advantage.
> >
>
> You can do this in parallel.
> You need to write a custom input format class. (Which is what you're
> already doing...)
>
> Lets see if I can explain this correctly.
> You have an XML record split across block A and block B.
>
> Your map reduce job will instantiate a task per block.
> So in mapper processing block A, you read and process the XML records...
> when you get to the last record, which is only in part of A, mapper A will
> continue on to block B and continue reading the last record. Then stops.
> In mapper for block B, the reader will skip and not process data until it
> sees the start of a record. So you end up getting all of your XML records
> processed (no duplication) and done in parallel.
>
> Does that make sense?
>
> -Mike
>
>
> > Date: Tue, 22 Nov 2011 03:08:20 +0000
> > From: [email protected]
> > Subject: RE: Regarding loading a big XML file to HDFS
> > To: [email protected]; [email protected]
> >
> > Also i am surprising, how you are writing mapreduce application here.
> Map and reduce will work with key value pairs.
> > ________________________________________
> > From: Uma Maheswara Rao G
> > Sent: Tuesday, November 22, 2011 8:33 AM
> > To: [email protected]; [email protected]
> > Subject: RE: Regarding loading a big XML file to HDFS
> >
> > >______________________________________
> > >From: hari708 [[email protected]]
> > >Sent: Tuesday, November 22, 2011 6:50 AM
> > >To: [email protected]
> > >Subject: Regarding loading a big XML file to HDFS
> >
> > >Hi,
> > >I have a big file consisting of XML data.the XML is not represented as a
> > >single line in the file. if we stream this file using ./hadoop dfs -put
> > >command to a hadoop directory .How the distribution happens.?
> >
> > HDFS will didvide the blocks based on your block size configured for the
> file.
> >
> > >Basically in My mapreduce program i am expecting a complete XML as my
> > >input.i have a CustomReader(for XML) in my mapreduce job
> configuration.My
> > >main confusion is if namenode distribute data to DataNodes ,there is a
> > >chance that a part of xml can go to one data node and other half can go
> in
> > >another datanode.If that is the case will my custom XMLReader in the
> > >mapreduce be able to combine it(as mapreduce reads data locally only).
> > >Please help me on this?
> >
> > if you can not do anything parallel here, make your input split size to
> cover complete file size.
> > also configure the block size to cover complete file size. In this case,
> only one mapper and reducer will be spawned for file. But here you wont get
> any parallel processing advantage.
> >
> > >--
> > >View this message in context:
> http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-
> >tp32871900p32871900.html
> > >Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
>
>



-- 
-- Inder

Re: Regarding loading a big XML file to HDFS

Reply via email to