Simple answer: don't. The Hadoop framework will take care of that for you and 
split the file. The logical 60 GB file you see in the HDFS actually *is* split 
into smaller chunks (default size is 64 MB) and physically distributed across 
the cluster.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.le...@gmail.com] 
Gesendet: Montag, 20. Juni 2011 08:36
An: mapreduce-user@hadoop.apache.org
Betreff: Re: How to split a big file in HDFS by size

Hi Christopher,
If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run 
a map-red job on those 60 text fixed length files ? If yes, do you have any 
idea how to do this ?
 


 
On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz 
<christoph.schm...@1und1.de> wrote:


        JJ,
        
        uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) 
will be slow. If possible, try to get the files in smaller chunks where they 
are created, and upload them in parallel with a simple MapReduce job that only 
passes the data through (i.e. uses the standard Mapper and Reducer classes). 
This job should read from your local input directory and output into the HDFS.
        
        If you cannot split the 60 GB where they are created, IMHO there is not 
much you can do. If you have a file format with, say, fixed length records, you 
could try to create your own InputFormat that splits the file logically without 
creating the actual splits locally (which would be too costly, I assume).
        
        The performance of reading in parallel, though, will depend to a large 
extent on the nature of your local storage. If you have a single hard drive, 
reading in parallel might actually be slower than reading serially because it 
means a lot of random disk accesses.
        
        Regards,
        Christoph
        
        -----Ursprüngliche Nachricht-----
        Von: Mapred Learn [mailto:mapred.le...@gmail.com]
        Gesendet: Montag, 20. Juni 2011 06:02
        An: mapreduce-user@hadoop.apache.org; cdh-u...@cloudera.org
        Betreff: How to split a big file in HDFS by size
        

        Hi,
        I am trying to upload text files in size 60 GB or more.
        I want to split these files into smaller files of say 1 GB each so that 
I can run further map-red jobs on it.
        
        Anybody has any idea how can I do this ?
        Thanks a lot in advance ! Any ideas are greatly appreciated !
        
        -JJ
        


Reply via email to