On Fri, Aug 31, 2007 at 03:25:03PM -0700, Joydeep Sen Sarma wrote: >Ah - very similar .. wish I had known :-) > >Support for Globs and logic to split/parallelize large text files would >help .. >
Please file these requests... ... and patches if you can! :) Arun >-----Original Message----- >From: Stu Hood [mailto:[EMAIL PROTECTED] >Sent: Friday, August 31, 2007 2:23 PM >To: hadoop-user@lucene.apache.org >Subject: RE: Re: Compression using Hadoop... > >Isn't that what the distcp script does? > >Thanks, >Stu > > >-----Original Message----- >From: Joydeep Sen Sarma >Sent: Friday, August 31, 2007 3:58pm >To: hadoop-user@lucene.apache.org >Subject: Re: Compression using Hadoop... > >One thing I had done to speed up copy/put speeds was write a simple >map-reduce job to do parallel copies of files from a input directory (in >our case the input directory is nfs mounted from all task nodes). It >gives us a huge speed-bump. > >It's trivial to roll ur own - but would be happy to share as well. > > >-----Original Message----- >From: C G [mailto:[EMAIL PROTECTED] >Sent: Friday, August 31, 2007 11:21 AM >To: hadoop-user@lucene.apache.org >Subject: RE: Compression using Hadoop... > >My input is typical row-based stuff across which are run a large stack >of aggregations/rollups. After reading earlier posts on this thread, I >modified my loader to split the input up into 1M row partitions >(literally gunzip -cd input.gz | split...). I then ran an experiment >using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster. >Ted, from what you are saying I should be using at least 80 files given >the cluster size, and I should modify the loader to be aware of the >number of nodes and split accordingly. Do you concur? > > Load time to HDFS may be the next challenge. My HDFS configuration on >8 nodes uses a replication factor of 3. Sequentially copying my data to >HDFS using -copyFromLocal took 23 minutes to move 266M in individual >files of 5.7M each. Does anybody find this result surprising? Note >that this is on EC2, where there is no such thing as rack-level or >switch-level locality. Should I expect dramatically better performance >on a real iron? Once I get this prototyping/education under my belt my >plan is to deploy a 64 node grid of 4 way machines with a terabyte of >local storage on each node. > > Thanks for the discussion...the Hadoop community is very helpful! > > C G > > >Ted Dunning wrote: > >They will only be a non-issue if you have enough of them to get the >parallelism you want. If you have number of gzip files > 10*number of >task nodes you should be fine. > > >-----Original Message----- >From: [EMAIL PROTECTED] on behalf of jason gessner >Sent: Fri 8/31/2007 9:38 AM >To: hadoop-user@lucene.apache.org >Subject: Re: Compression using Hadoop... > >ted, will the gzip files be a non-issue as far as splitting goes if >they are under the default block size? > >C G, glad i could help a little. > >-jason > >On 8/31/07, C G >wrote: >> Thanks Ted and Jason for your comments. Ted, your comments about gzip >not being splittable was very timely...I'm watching my 8 node cluster >saturate one node (with one gz file) and was wondering why. Thanks for >the "answer in advance" :-). >> >> Ted Dunning wrote: >> With gzipped files, you do face the problem that your parallelism in >the map >> phase is pretty much limited to the number of files you have (because >> gzip'ed files aren't splittable). This is often not a problem since >most >> people can arrange to have dozens to hundreds of input files easier >than >> they can arrange to have dozens to hundreds of CPU cores working on >their >> data. > > > >--------------------------------- >Luggage? GPS? Comic books? >Check out fitting gifts for grads at Yahoo! Search.