Re: Compression using Hadoop...

2007-09-04 Thread Doug Cutting
Ted Dunning wrote: I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies. Would the committers be open to some small

Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
PROTECTED] on behalf of jason gessner Sent: Fri 8/31/2007 9:38 AM To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8/31/07, C G [EMAIL PROTECTED] wrote: Thanks Ted and Jason for your comments. Ted

Re: Compression using Hadoop...

2007-08-31 Thread Doug Cutting
Arun C Murthy wrote: One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job,

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
Message- From: C G [mailto:[EMAIL PROTECTED] Sent: Fri 8/31/2007 11:21 AM To: hadoop-user@lucene.apache.org Subject: RE: Compression using Hadoop... Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware

Re: Compression using Hadoop...

2007-08-31 Thread Milind Bhandarkar
On 8/31/07 10:43 AM, Doug Cutting [EMAIL PROTECTED] wrote: We really need someone to contribute an InputFormat for bzip files. This has come up before: bzip is a standard compression format that is splittable. +1 - milind -- Milind Bhandarkar 408-349-2136 ([EMAIL PROTECTED])

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
Subject: Re: Compression using Hadoop... On 8/31/07 10:43 AM, Doug Cutting [EMAIL PROTECTED] wrote: We really need someone to contribute an InputFormat for bzip files. This has come up before: bzip is a standard compression format that is splittable. +1 - milind -- Milind Bhandarkar 408-349

RE: Re: Compression using Hadoop...

2007-08-31 Thread Joydeep Sen Sarma
using Hadoop... Isn't that what the distcp script does? Thanks, Stu -Original Message- From: Joydeep Sen Sarma Sent: Friday, August 31, 2007 3:58pm To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... One thing I had done to speed up copy/put speeds was write a simple

Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote: Arun C Murthy wrote: One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from

Re: Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
Hood [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 2:23 PM To: hadoop-user@lucene.apache.org Subject: RE: Re: Compression using Hadoop... Isn't that what the distcp script does? Thanks, Stu -Original Message- From: Joydeep Sen Sarma Sent: Friday, August 31, 2007 3:58pm

Compression using Hadoop...

2007-08-30 Thread C G
Hello All: I think I must be missing something fundamental. Is it possible to load compressed data into HDFS, and then operate on it directly with map/reduce? I see a lot of stuff in the docs about writing compressed outputs, but nothing about reading compressed inputs. Am I being

Re: Compression using Hadoop...

2007-08-30 Thread jason gessner
if you put .gz files up on your HDFS cluster you don't need to do anything to read them. I see lots of extra control via the API, but i have simply put the files up and run my jobs on them. -jason On 8/30/07, C G [EMAIL PROTECTED] wrote: Hello All: I think I must be missing something

Re: Compression using Hadoop...

2007-08-30 Thread Ted Dunning
With gzipped files, you do face the problem that your parallelism in the map phase is pretty much limited to the number of files you have (because gzip'ed files aren't splittable). This is often not a problem since most people can arrange to have dozens to hundreds of input files easier than