RE: :!

Amogh Vasekar Mon, 03 Aug 2009 00:20:38 -0700

Maybe I'm missing the point, but in terms of execution performance benefit, 
what does copying to dfs and then compressing to be fed to a map/reduce job 
provide? Isn't it better to compress "offline" / outside latency window and 
make available on dfs?
Also, your mapreduce program will launch one map task per compressed file, so 
make sure you design your compression accordingly.


Thanks,
Amogh
-----Original Message-----
From: Sugandha Naolekar [mailto:[email protected]] 
Sent: Monday, August 03, 2009 12:32 PM
To: [email protected]
Subject: Re: :!

dats fine. But, if I place the data in HDFS and then run map reduce code to
provide compression, then the data will get compressed in sequence files
but, even the original data will reside in the memory;thereby leading or
causing a kind of redundancy of data...

Can u pls suggest me a way out?/

On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi <
[email protected]> wrote:

> I don't think you will be able to compress some data unless it's on HDFS.
> What you can do is
> 1. Manually compress the data on the machine where the data resides. Then,
> copy the same to
>  HDFS. or
> 2. Copy the data without compressing to HDFS, then run a job which just
> emits the data as it reads
>  in key/value pair. You can set
> FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
>  that output gets gzipped.
>
> Does that solve your problem?
>
> btw you didn't exactly specify your data size (how many TBs).
>
> On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
> <[email protected]>wrote:
>
> > Yes, You are right. Here goes the details related::
> >
> > -> I have a Hadoop cluster of 7 nodes. Now there is this 8th machine,
> which
> > is not a part of the hadoop cluster.
> > -> I want to place the data of that machine into the HDFS. Thus, before
> > placing it in HDFS, I want to compress it, and then dump in the HDFS.
> > -> I have 4 datanodes in my cluster. also, data might get extended upto
> > tera
> > bytes.
> > -> Also, i have set thr replication factor as 2.
> > -> I guess, for compression, I will have to run map reduce...?
> > right..please
> > tel me the complete approach that is needed to be followed.
> >
> > On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi <
> > [email protected]> wrote:
> >
> > > By "I want to compress the data first and then place it in HDFS", do
> you
> > > mean you want to compress the data
> > > locally and then copy to DFS?
> > >
> > > What's the size of your data? What's the capacity of HDFS?
> > >
> > > On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
> > > <[email protected]>wrote:
> > >
> > > > I want to compress the data first and then place it in HDFS. Again,
> > while
> > > > retrieving the same, I want to uncompress it and place on the desired
> > > > destination. Can this be possible. How to get started? Also, I want
> to
> > > get
> > > > started with actual coding part of compression and MAP reduce. PLease
> > > > suggest me aptly...!
> > > >
> > > >
> > > >
> > > > --
> > > > Regards!
> > > > Sugandha
> > > >
> > >
> >
> >
> >
> > --
> > Regards!
> > Sugandha
> >
>



-- 
Regards!
Sugandha

RE: :!

Reply via email to