Hi Christopher,
If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then
run a map-red job on those 60 text fixed length files ? If yes, do you have
any idea how to do this ?




On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <
christoph.schm...@1und1.de> wrote:

> JJ,
>
> uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will
> be slow. If possible, try to get the files in smaller chunks where they are
> created, and upload them in parallel with a simple MapReduce job that only
> passes the data through (i.e. uses the standard Mapper and Reducer classes).
> This job should read from your local input directory and output into the
> HDFS.
>
> If you cannot split the 60 GB where they are created, IMHO there is not
> much you can do. If you have a file format with, say, fixed length records,
> you could try to create your own InputFormat that splits the file logically
> without creating the actual splits locally (which would be too costly, I
> assume).
>
> The performance of reading in parallel, though, will depend to a large
> extent on the nature of your local storage. If you have a single hard drive,
> reading in parallel might actually be slower than reading serially because
> it means a lot of random disk accesses.
>
> Regards,
> Christoph
>
> -----Ursprüngliche Nachricht-----
> Von: Mapred Learn [mailto:mapred.le...@gmail.com]
> Gesendet: Montag, 20. Juni 2011 06:02
> An: mapreduce-user@hadoop.apache.org; cdh-u...@cloudera.org
> Betreff: How to split a big file in HDFS by size
>
> Hi,
> I am trying to upload text files in size 60 GB or more.
> I want to split these files into smaller files of say 1 GB each so that I
> can run further map-red jobs on it.
>
> Anybody has any idea how can I do this ?
> Thanks a lot in advance ! Any ideas are greatly appreciated !
>
> -JJ
>

Reply via email to