On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote:
> Hi Ron,
> Loading into HDFS and HBase are two different issues.  
> 
> HDFS: if you have a large number of files to load from your nfs file system 
> into HDFS it is not clear that parallelizing the load will help. 

Its not nfs. Its a parallel file system.

>  You have two sources of bottlenecks: the nfs file system and the HDFS file 
> system.  In your parallel example, you will likely saturate your nfs file 
> system first.

Unlikely in this case. We're in the unusual position of our archive
cluster being faster then our hadoop cluster.

>   If they are actually local files, then loading them via M/R is a 
> non-starter as you have no control over which machine will get a map task.

If the same files are "local" on each node, does it matter? Shouldn't
the map jobs all be scheduled in a way as to spread out the load?

Thanks,
Kevin

>   Unless all of the machines have files in the same directory and you are 
> just going to look in that directory to upload.  Then, it sounds like more of 
> a job for a parallel shell command and less of a map/reduce command.
> 
> HBase: So far my strategy has been to get the files into HDFS first, and then 
> write a Map job to load them into HBase.  You can try to do this and see if 
> direct inserts into hbase are fast enough for your use case.  But, if you are 
> going to TBs/week then you will likely want to investigate the bulk load 
> features.  I haven't yet incorporated that into my workflow so I can't offer 
> much advice there. Just be sure your cluster is sized appropriately.  E.g., 
> with your compression turned on in hbase, see how much a 1 GB input file 
> expands to inside hbase / hdfs.  That should give you a feeling for how much 
> space you will need for your expected data load.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Taylor, Ronald C [mailto:[email protected]] 
> Sent: Tuesday, December 28, 2010 2:05 PM
> To: '[email protected]'; '[email protected]'
> Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
> Subject: What is the fastest way to get a large amount of data into the 
> Hadoop HDFS file system (or Hbase)?
> 
> 
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop 
> cluster, with Hbase operating on top of Hadoop. Figure eventually on the 
> order of multiple terabytes per week. So - we are concerned about doing the 
> uploads themselves as fast as possible from our native Linux file system into 
> HDFS. Figure files will be in, roughly, the 1 to 300 GB range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java 
> MapReduce program would work fastest. So my idea would be to have a file 
> listing all the data files (full paths) to be uploaded, one per line, and 
> then use that listing file as input to a MapReduce program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs 
> -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with 
> the Mappers operating on all the nodes of the cluster, spreading out the file 
> upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? 
> Anything else out there for doing automated upload in parallel? We would very 
> much appreciate advice in this area, since we believe upload speed might 
> become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: [email protected]
> 
> 


Reply via email to