RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Kevin Fox Fri, 31 Dec 2010 00:43:11 -0800

On Tue, 2010-12-28 at 16:04 -0800, Taylor, Ronald C wrote:
> Hi Kevin,
> 
> So - from what Patrick and Ted are saying it sounds like we want the best way 
> to parallelize a source-based push, rather than doing a parallelized pull 
> through a MapReduce program. And I see that what you ask about below is on 
> parallelizing a push, so we are on the same page.
>  Ron


Hi Ron,

I think there are merits for both approaches, a depending on the type of
data it is. If you only ever need a subset of the data for reprocessing,
map/reducing then transfer might be better. If you want all the raw data
of a particular set, push transfer may make more sense.

So we'd need solutions to both:
*parallel map reduce on non hdfs storage cluster, push to hdfs
reprocessing cluster.
*parallel push data from non hdfs storage cluster, into hdfs
reprocessing cluster.

Thanks,
Kevin

> 
> -----Original Message-----
> From: Fox, Kevin M 
> Sent: Tuesday, December 28, 2010 3:39 PM
> To: Patrick Angeles
> Cc: [email protected]; [email protected]; Taylor, Ronald C; 
> Brown, David M JR
> Subject: Re: What is the fastest way to get a large amount of data into the 
> Hadoop HDFS file system (or Hbase)?
> 
> On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote:
> > Ron,
> > 
> > 
> > While MapReduce can help to parallelize the load effort, your likely 
> > bottleneck is the source system (where the files come from). If the 
> > files are coming from a single server, then parallelizing the load 
> > won't gain you much past a certain point. You have to figure in how 
> > fast you can read the file(s) off disk(s) and push the bits through 
> > your network and finally onto HDFS.
> > 
> > 
> > The best scenario is if you can parallelize the reads and have a fat 
> > network pipe (10GbE or more) going into your Hadoop cluster.
> 
> 
> We have a way to parallelize a push from the archive storage cluster to the 
> hadoop storage cluster.
> 
> Is there a way to target a particular storage node with a push into the 
> hadoop file system? The hadoop cluster nodes are 1gig attached to its core 
> switch and we have a 10 gig uplink to the core from the storage archive. Say, 
> we have 4 nodes in each storage cluster (we have more, just a simplified 
> example):
> 
> a0 --\                                /-- h0
> a1 --+                                +-- h1
> a2 --+ (A switch) -10gige- (h switch) +-- h2
> a3 --/                                \-- h3
> 
> I want to be able to have a0 talk to h0 and not have h0 decide the data 
> belongs on h3, slowing down a3's ability to write data into h3, greatly 
> reducing bandwidth.
> 
> Thanks,
> Kevin
> 
> > 
> > 
> > Regards,
> > 
> > 
> > - Patrick
> > 
> > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
> > <[email protected]> wrote:
> >         
> >         Folks,
> >         
> >         We plan on uploading large amounts of data on a regular basis
> >         onto a Hadoop cluster, with Hbase operating on top of Hadoop.
> >         Figure eventually on the order of multiple terabytes per week.
> >         So - we are concerned about doing the uploads themselves as
> >         fast as possible from our native Linux file system into HDFS.
> >         Figure files will be in, roughly, the 1 to 300 GB range.
> >         
> >         Off the top of my head, I'm thinking that doing this in
> >         parallel using a Java MapReduce program would work fastest. So
> >         my idea would be to have a file listing all the data files
> >         (full paths) to be uploaded, one per line, and then use that
> >         listing file as input to a MapReduce program.
> >         
> >         Each Mapper would then upload one of the data files (using
> >         "hadoop fs -copyFromLocal <source> <dest>") in parallel with
> >         all the other Mappers, with the Mappers operating on all the
> >         nodes of the cluster, spreading out the file upload across the
> >         nodes.
> >         
> >         Does that sound like a wise way to approach this? Are there
> >         better methods? Anything else out there for doing automated
> >         upload in parallel? We would very much appreciate advice in
> >         this area, since we believe upload speed might become a
> >         bottleneck.
> >         
> >          - Ron Taylor
> >         
> >         ___________________________________________
> >         Ronald Taylor, Ph.D.
> >         Computational Biology & Bioinformatics Group
> >         
> >         Pacific Northwest National Laboratory
> >         902 Battelle Boulevard
> >         P.O. Box 999, Mail Stop J4-33
> >         Richland, WA  99352 USA
> >         Office:  509-372-6568
> >         Email: [email protected]
> >         
> >         
> > 
> > 
> 
>

RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?

Reply via email to