On Fri, Aug 31, 2007 at 03:25:03PM -0700, Joydeep Sen Sarma wrote:
>Ah - very similar .. wish I had known :-)
>
>Support for Globs and logic to split/parallelize large text files would
>help ..
>

Please file these requests...

... and patches if you can! :)

Arun

>-----Original Message-----
>From: Stu Hood [mailto:[EMAIL PROTECTED] 
>Sent: Friday, August 31, 2007 2:23 PM
>To: hadoop-user@lucene.apache.org
>Subject: RE: Re: Compression using Hadoop...
>
>Isn't that what the distcp script does?
>
>Thanks,
>Stu
>
>
>-----Original Message-----
>From: Joydeep Sen Sarma 
>Sent: Friday, August 31, 2007 3:58pm
>To: hadoop-user@lucene.apache.org
>Subject: Re: Compression using Hadoop...
>
>One thing I had done to speed up copy/put speeds was write a simple
>map-reduce job to do parallel copies of files from a input directory (in
>our case the input directory is nfs mounted from all task nodes). It
>gives us a huge speed-bump.
>
>It's trivial to roll ur own - but would be happy to share as well.
>
>
>-----Original Message-----
>From: C G [mailto:[EMAIL PROTECTED] 
>Sent: Friday, August 31, 2007 11:21 AM
>To: hadoop-user@lucene.apache.org
>Subject: RE: Compression using Hadoop...
>
>My input is typical row-based stuff across which are run a large stack
>of aggregations/rollups.  After reading earlier posts on this thread, I
>modified my loader to split the input up into 1M row partitions
>(literally gunzip -cd input.gz | split...).  I then ran an experiment
>using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster.
>Ted, from what you are saying I should be using at least 80 files given
>the cluster size, and I should modify the loader to be aware of the
>number of nodes and split accordingly. Do you concur?
>   
>  Load time to HDFS may be the next challenge.  My HDFS configuration on
>8 nodes uses a replication factor of 3.  Sequentially copying my data to
>HDFS using -copyFromLocal took 23 minutes to move 266M in individual
>files of 5.7M each.  Does anybody find this result surprising?  Note
>that this is on EC2, where there is no such thing as rack-level or
>switch-level locality.  Should I expect dramatically better performance
>on a real iron?  Once I get this prototyping/education under my belt my
>plan is to deploy a 64 node grid of 4 way machines with a terabyte of
>local storage on each node.
>   
>  Thanks for the discussion...the Hadoop community is very helpful!
>   
>  C G 
>    
>
>Ted Dunning  wrote:
>  
>They will only be a non-issue if you have enough of them to get the
>parallelism you want. If you have number of gzip files > 10*number of
>task nodes you should be fine.
>
>
>-----Original Message-----
>From: [EMAIL PROTECTED] on behalf of jason gessner
>Sent: Fri 8/31/2007 9:38 AM
>To: hadoop-user@lucene.apache.org
>Subject: Re: Compression using Hadoop...
>
>ted, will the gzip files be a non-issue as far as splitting goes if
>they are under the default block size?
>
>C G, glad i could help a little.
>
>-jason
>
>On 8/31/07, C G 
>wrote:
>> Thanks Ted and Jason for your comments. Ted, your comments about gzip
>not being splittable was very timely...I'm watching my 8 node cluster
>saturate one node (with one gz file) and was wondering why. Thanks for
>the "answer in advance" :-).
>>
>> Ted Dunning wrote:
>> With gzipped files, you do face the problem that your parallelism in
>the map
>> phase is pretty much limited to the number of files you have (because
>> gzip'ed files aren't splittable). This is often not a problem since
>most
>> people can arrange to have dozens to hundreds of input files easier
>than
>> they can arrange to have dozens to hundreds of CPU cores working on
>their
>> data.
>
>
>       
>---------------------------------
>Luggage? GPS? Comic books? 
>Check out fitting  gifts for grads at Yahoo! Search.

Reply via email to