On Tue, 2010-12-28 at 16:04 -0800, Taylor, Ronald C wrote: > Hi Kevin, > > So - from what Patrick and Ted are saying it sounds like we want the best way > to parallelize a source-based push, rather than doing a parallelized pull > through a MapReduce program. And I see that what you ask about below is on > parallelizing a push, so we are on the same page. > Ron
Hi Ron, I think there are merits for both approaches, a depending on the type of data it is. If you only ever need a subset of the data for reprocessing, map/reducing then transfer might be better. If you want all the raw data of a particular set, push transfer may make more sense. So we'd need solutions to both: *parallel map reduce on non hdfs storage cluster, push to hdfs reprocessing cluster. *parallel push data from non hdfs storage cluster, into hdfs reprocessing cluster. Thanks, Kevin > > -----Original Message----- > From: Fox, Kevin M > Sent: Tuesday, December 28, 2010 3:39 PM > To: Patrick Angeles > Cc: [email protected]; [email protected]; Taylor, Ronald C; > Brown, David M JR > Subject: Re: What is the fastest way to get a large amount of data into the > Hadoop HDFS file system (or Hbase)? > > On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote: > > Ron, > > > > > > While MapReduce can help to parallelize the load effort, your likely > > bottleneck is the source system (where the files come from). If the > > files are coming from a single server, then parallelizing the load > > won't gain you much past a certain point. You have to figure in how > > fast you can read the file(s) off disk(s) and push the bits through > > your network and finally onto HDFS. > > > > > > The best scenario is if you can parallelize the reads and have a fat > > network pipe (10GbE or more) going into your Hadoop cluster. > > > We have a way to parallelize a push from the archive storage cluster to the > hadoop storage cluster. > > Is there a way to target a particular storage node with a push into the > hadoop file system? The hadoop cluster nodes are 1gig attached to its core > switch and we have a 10 gig uplink to the core from the storage archive. Say, > we have 4 nodes in each storage cluster (we have more, just a simplified > example): > > a0 --\ /-- h0 > a1 --+ +-- h1 > a2 --+ (A switch) -10gige- (h switch) +-- h2 > a3 --/ \-- h3 > > I want to be able to have a0 talk to h0 and not have h0 decide the data > belongs on h3, slowing down a3's ability to write data into h3, greatly > reducing bandwidth. > > Thanks, > Kevin > > > > > > > Regards, > > > > > > - Patrick > > > > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C > > <[email protected]> wrote: > > > > Folks, > > > > We plan on uploading large amounts of data on a regular basis > > onto a Hadoop cluster, with Hbase operating on top of Hadoop. > > Figure eventually on the order of multiple terabytes per week. > > So - we are concerned about doing the uploads themselves as > > fast as possible from our native Linux file system into HDFS. > > Figure files will be in, roughly, the 1 to 300 GB range. > > > > Off the top of my head, I'm thinking that doing this in > > parallel using a Java MapReduce program would work fastest. So > > my idea would be to have a file listing all the data files > > (full paths) to be uploaded, one per line, and then use that > > listing file as input to a MapReduce program. > > > > Each Mapper would then upload one of the data files (using > > "hadoop fs -copyFromLocal <source> <dest>") in parallel with > > all the other Mappers, with the Mappers operating on all the > > nodes of the cluster, spreading out the file upload across the > > nodes. > > > > Does that sound like a wise way to approach this? Are there > > better methods? Anything else out there for doing automated > > upload in parallel? We would very much appreciate advice in > > this area, since we believe upload speed might become a > > bottleneck. > > > > - Ron Taylor > > > > ___________________________________________ > > Ronald Taylor, Ph.D. > > Computational Biology & Bioinformatics Group > > > > Pacific Northwest National Laboratory > > 902 Battelle Boulevard > > P.O. Box 999, Mail Stop J4-33 > > Richland, WA 99352 USA > > Office: 509-372-6568 > > Email: [email protected] > > > > > > > > > >
