My input is typical row-based stuff across which are run a large stack of 
aggregations/rollups.  After reading earlier posts on this thread, I modified 
my loader to split the input up into 1M row partitions (literally gunzip -cd 
input.gz | split...).  I then ran an experiment using 50M rows (i.e. 50 gz 
files loaded into HDFS) on a 8 node cluster. Ted, from what you are saying I 
should be using at least 80 files given the cluster size, and I should modify 
the loader to be aware of the number of nodes and split accordingly. Do you 
concur?
   
  Load time to HDFS may be the next challenge.  My HDFS configuration on 8 
nodes uses a replication factor of 3.  Sequentially copying my data to HDFS 
using -copyFromLocal took 23 minutes to move 266M in individual files of 5.7M 
each.  Does anybody find this result surprising?  Note that this is on EC2, 
where there is no such thing as rack-level or switch-level locality.  Should I 
expect dramatically better performance on a real iron?  Once I get this 
prototyping/education under my belt my plan is to deploy a 64 node grid of 4 
way machines with a terabyte of local storage on each node.
   
  Thanks for the discussion...the Hadoop community is very helpful!
   
  C G 
    

Ted Dunning <[EMAIL PROTECTED]> wrote:
  
They will only be a non-issue if you have enough of them to get the parallelism 
you want. If you have number of gzip files > 10*number of task nodes you should 
be fine.


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...

ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G 
wrote:
> Thanks Ted and Jason for your comments. Ted, your comments about gzip not 
> being splittable was very timely...I'm watching my 8 node cluster saturate 
> one node (with one gz file) and was wondering why. Thanks for the "answer in 
> advance" :-).
>
> Ted Dunning wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.


       
---------------------------------
Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.

Reply via email to