Hi Christopher, If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red job on those 60 text fixed length files ? If yes, do you have any idea how to do this ?
On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz < christoph.schm...@1und1.de> wrote: > JJ, > > uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will > be slow. If possible, try to get the files in smaller chunks where they are > created, and upload them in parallel with a simple MapReduce job that only > passes the data through (i.e. uses the standard Mapper and Reducer classes). > This job should read from your local input directory and output into the > HDFS. > > If you cannot split the 60 GB where they are created, IMHO there is not > much you can do. If you have a file format with, say, fixed length records, > you could try to create your own InputFormat that splits the file logically > without creating the actual splits locally (which would be too costly, I > assume). > > The performance of reading in parallel, though, will depend to a large > extent on the nature of your local storage. If you have a single hard drive, > reading in parallel might actually be slower than reading serially because > it means a lot of random disk accesses. > > Regards, > Christoph > > -----Ursprüngliche Nachricht----- > Von: Mapred Learn [mailto:mapred.le...@gmail.com] > Gesendet: Montag, 20. Juni 2011 06:02 > An: mapreduce-user@hadoop.apache.org; cdh-u...@cloudera.org > Betreff: How to split a big file in HDFS by size > > Hi, > I am trying to upload text files in size 60 GB or more. > I want to split these files into smaller files of say 1 GB each so that I > can run further map-red jobs on it. > > Anybody has any idea how can I do this ? > Thanks a lot in advance ! Any ideas are greatly appreciated ! > > -JJ >