Re: Regarding loading a big XML file to HDFS

Mridul Muralidharan Tue, 22 Nov 2011 04:49:37 -0800

You cannot determine start of an xml document from a collection of xmldocuments (in the dfs file) if you start at some arbitrary point withinit the collection (unless some data specific hints are used).



Regards,
Mridul

On Tuesday 22 November 2011 09:28 AM, Michael Segel wrote:


Just wanted to address this:

Basically in My mapreduce program i am expecting a complete XML as my
input.i have a CustomReader(for XML) in my mapreduce job configuration.My
main confusion is if namenode distribute data to DataNodes ,there is a
chance that a part of xml can go to one data node and other half can go in
another datanode.If that is the case will my custom XMLReader in the
mapreduce be able to combine it(as mapreduce reads data locally only).
Please help me on this?


if you can not do anything parallel here, make your input split size to cover 
complete file size.

  also configure the block size to cover complete file size. In this
case, only one mapper and reducer will be spawned for file. But here you
  wont get any parallel processing advantage.


You can do this in parallel.
You need to write a custom input format class. (Which is what you're already 
doing...)

Lets see if I can explain this correctly.
You have an XML record split across block A and block B.

Your map reduce job will instantiate a task per block.
So in mapper processing block A, you read and process the XML records... when 
you get to the last record, which is only in part of A, mapper A will continue 
on to block B and continue reading the last record. Then stops.
In mapper for block B, the reader will skip and not process data until it sees 
the start of a record. So you end up getting all of your XML records processed 
(no duplication) and done in parallel.

Does that make sense?

-Mike

Date: Tue, 22 Nov 2011 03:08:20 +0000
From: [email protected]
Subject: RE: Regarding loading a big XML file to HDFS
To: [email protected]; [email protected]

Also i am surprising, how you are writing mapreduce application here. Map and 
reduce will work with key value pairs.
________________________________________
From: Uma Maheswara Rao G
Sent: Tuesday, November 22, 2011 8:33 AM
To: [email protected]; [email protected]
Subject: RE: Regarding loading a big XML file to HDFS

______________________________________
From: hari708 [[email protected]]
Sent: Tuesday, November 22, 2011 6:50 AM
To: [email protected]
Subject: Regarding loading a big XML file to HDFS

Hi,
I have a big file consisting of XML data.the XML is not represented as a
single line in the file. if we stream this file using ./hadoop dfs -put
command to a hadoop directory .How the distribution happens.?


HDFS will didvide the blocks based on your block size configured for the file.

Basically in My mapreduce program i am expecting a complete XML as my
input.i have a CustomReader(for XML) in my mapreduce job configuration.My
main confusion is if namenode distribute data to DataNodes ,there is a
chance that a part of xml can go to one data node and other half can go in
another datanode.If that is the case will my custom XMLReader in the
mapreduce be able to combine it(as mapreduce reads data locally only).
Please help me on this?


if you can not do anything parallel here, make your input split size to cover 
complete file size.
also configure the block size to cover complete file size. In this case, only 
one mapper and reducer will be spawned for file. But here you wont get any 
parallel processing advantage.

--
View this message in context: 
http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Regarding loading a big XML file to HDFS

Reply via email to