what about the records at skipped boundaries? Instead is there a way to define a custom splitter in hadoop which can understand record boundaries.
- Inder On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <[email protected]>wrote: > > Just wanted to address this: > > >Basically in My mapreduce program i am expecting a complete XML as my > > >input.i have a CustomReader(for XML) in my mapreduce job > configuration.My > > >main confusion is if namenode distribute data to DataNodes ,there is a > > >chance that a part of xml can go to one data node and other half can go > in > > >another datanode.If that is the case will my custom XMLReader in the > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > >Please help me on this? > > > > if you can not do anything parallel here, make your input split size to > cover complete file size. > > > also configure the block size to cover complete file size. In this > case, only one mapper and reducer will be spawned for file. But here you > wont get any parallel processing advantage. > > > > You can do this in parallel. > You need to write a custom input format class. (Which is what you're > already doing...) > > Lets see if I can explain this correctly. > You have an XML record split across block A and block B. > > Your map reduce job will instantiate a task per block. > So in mapper processing block A, you read and process the XML records... > when you get to the last record, which is only in part of A, mapper A will > continue on to block B and continue reading the last record. Then stops. > In mapper for block B, the reader will skip and not process data until it > sees the start of a record. So you end up getting all of your XML records > processed (no duplication) and done in parallel. > > Does that make sense? > > -Mike > > > > Date: Tue, 22 Nov 2011 03:08:20 +0000 > > From: [email protected] > > Subject: RE: Regarding loading a big XML file to HDFS > > To: [email protected]; [email protected] > > > > Also i am surprising, how you are writing mapreduce application here. > Map and reduce will work with key value pairs. > > ________________________________________ > > From: Uma Maheswara Rao G > > Sent: Tuesday, November 22, 2011 8:33 AM > > To: [email protected]; [email protected] > > Subject: RE: Regarding loading a big XML file to HDFS > > > > >______________________________________ > > >From: hari708 [[email protected]] > > >Sent: Tuesday, November 22, 2011 6:50 AM > > >To: [email protected] > > >Subject: Regarding loading a big XML file to HDFS > > > > >Hi, > > >I have a big file consisting of XML data.the XML is not represented as a > > >single line in the file. if we stream this file using ./hadoop dfs -put > > >command to a hadoop directory .How the distribution happens.? > > > > HDFS will didvide the blocks based on your block size configured for the > file. > > > > >Basically in My mapreduce program i am expecting a complete XML as my > > >input.i have a CustomReader(for XML) in my mapreduce job > configuration.My > > >main confusion is if namenode distribute data to DataNodes ,there is a > > >chance that a part of xml can go to one data node and other half can go > in > > >another datanode.If that is the case will my custom XMLReader in the > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > >Please help me on this? > > > > if you can not do anything parallel here, make your input split size to > cover complete file size. > > also configure the block size to cover complete file size. In this case, > only one mapper and reducer will be spawned for file. But here you wont get > any parallel processing advantage. > > > > >-- > > >View this message in context: > http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS- > >tp32871900p32871900.html > > >Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > -- -- Inder
