Ted Dunning wrote:
I have to say, btw, that the source tree structure of this project is pretty
ornate and not very parallel. I needed to add 10 source roots in IntelliJ to
get a clean compile. In this process, I noticed some circular dependencies.
Would the committers be open to some small
PROTECTED] on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
-jason
On 8
@lucene.apache.org
Subject: Re: Compression using Hadoop...
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
-jason
On 8/31/07, C G [EMAIL PROTECTED] wrote:
Thanks Ted and Jason for your comments. Ted
Arun C Murthy wrote:
One way to reap benefits of both compression and better parallelism is to use
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
Of course this means you will have to do a conversion from .gzip to .seq file
and load it onto hdfs for your job,
Message-
From: C G [mailto:[EMAIL PROTECTED]
Sent: Fri 8/31/2007 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Compression using Hadoop...
Ted, from what you are saying I should be using at least 80 files given the
cluster size, and I should modify the loader to be aware
On 8/31/07 10:43 AM, Doug Cutting [EMAIL PROTECTED] wrote:
We really need someone to contribute an InputFormat for bzip files.
This has come up before: bzip is a standard compression format that is
splittable.
+1
- milind
--
Milind Bhandarkar
408-349-2136
([EMAIL PROTECTED])
Subject: Re: Compression using Hadoop...
On 8/31/07 10:43 AM, Doug Cutting [EMAIL PROTECTED] wrote:
We really need someone to contribute an InputFormat for bzip files.
This has come up before: bzip is a standard compression format that is
splittable.
+1
- milind
--
Milind Bhandarkar
408-349
using Hadoop...
Isn't that what the distcp script does?
Thanks,
Stu
-Original Message-
From: Joydeep Sen Sarma
Sent: Friday, August 31, 2007 3:58pm
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
One thing I had done to speed up copy/put speeds was write a simple
On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote:
Arun C Murthy wrote:
One way to reap benefits of both compression and better parallelism is to
use compressed SequenceFiles:
http://wiki.apache.org/lucene-hadoop/SequenceFile
Of course this means you will have to do a conversion from
Hood [mailto:[EMAIL PROTECTED]
Sent: Friday, August 31, 2007 2:23 PM
To: hadoop-user@lucene.apache.org
Subject: RE: Re: Compression using Hadoop...
Isn't that what the distcp script does?
Thanks,
Stu
-Original Message-
From: Joydeep Sen Sarma
Sent: Friday, August 31, 2007 3:58pm
Hello All:
I think I must be missing something fundamental. Is it possible to load
compressed data into HDFS, and then operate on it directly with map/reduce? I
see a lot of stuff in the docs about writing compressed outputs, but nothing
about reading compressed inputs.
Am I being
if you put .gz files up on your HDFS cluster you don't need to do
anything to read them. I see lots of extra control via the API, but i
have simply put the files up and run my jobs on them.
-jason
On 8/30/07, C G [EMAIL PROTECTED] wrote:
Hello All:
I think I must be missing something
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable). This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
13 matches
Mail list logo