Re: Compaction job output not compressed

Abhishek Tiwari Thu, 01 Feb 2018 18:48:46 -0800

Hi Sushant,

Nice to know that you were able to get it to work. I agree that probably
the type params are not being propagated. So, it would need a small fix.
Please feel free to submit a PR if you know what needs to be changed, or
else please create a Jira for it.


Regards,
Abhishek

On Thu, Feb 1, 2018 at 1:24 AM, Sushant Pandey <[email protected]
> wrote:

> Hi Abhishek,
>
>
>
> As discussed, I tried following parameters to achieve compression –
>
>
>
> mapred.output.compress=true
>
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>
> mapred.output.compression.type=BLOCK
>
>
>
> I was able to achieve to the compression, but with Deflate codec even
> though I specified Snappy. I suspect codec and type parameters are not
> propagated to Hadoop libs and thus it defaults to Defalte which is set as
> default codec in my setup.
>
>
>
> Is there a workaround available to beat this behavior?
>
>
>
> Thanks
>
> sushant
>
>
>
> *From: *Abhishek Tiwari <[email protected]>
> *Sent: *Thursday, January 25, 2018 4:38 PM
> *To: *Sushant Pandey <[email protected]>
> *Cc: *[email protected]
> *Subject: *Re: Compaction job output not compressed
>
>
>
> This looks bit weird, all you generally need is:
>
>
>
> writer.output.format=AVRO
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
>
> and optionally,
>
> writer.codec.type=SNAPPY
>
>
>
> .. which if not specified defaults to deflate.
>
>
>
> I double checked our compaction pipeline output through avro-tools and it
> looks good as well. Not sure why you are not seeing the compression. Can
> you share the job file as well, and I will try to run a clean deployment of
> compaction to see if I can reproduce this.
>
>
>
> What is the Gobblin version that you are using?
>
>
>
> Regards,
>
> Abhishek
>
>
>
> On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey <
> [email protected]> wrote:
>
> Thanks Abhishek for the response.
>
>
>
> I have a follow up query. The AVRO meta information in input file has
> *avro.codec* property ‘s value set to *snappy*, where as the output file
> has this set to *null*, please see the attach (for file size comparison
> and avro codec value). This behavior persists even if parameters
> *compaction.input.deduplicated* and *compaction.input.deduplicated* are
> set to *false. *
>
>
>
> Is this an expected behavior?
>
>
>
> Thanks
>
> sushant
>
>
>
>
>
>
>
> *From: *Abhishek Tiwari <[email protected]>
> *Sent: *Thursday, January 25, 2018 4:14 AM
> *To: *[email protected]
> *Subject: *Re: Compaction job output not compressed
>
>
>
> Hi Sushant,
>
>
>
> Just to set the expectations, compaction job removes duplicate records. The
>
> way it does is by comparing 'key' field's values. If the key fields are not
>
> specified, it compares the whole record for comparison sake. Having said
>
> that, do you have duplicate records that are not getting removed? If so,
>
> what does logs say?
>
>
>
> Regards,
>
> Abhishek
>
>
>
> On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
>
> [email protected]> wrote:
>
>
>
> > Hi,
>
> >
>
> > I am executing a compaction job on snappy compressed Avro files. Though
>
> > the job is executing successfully the output is not compressed. Following
>
> > is my configuration for compaction job –
>
> >
>
> > fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
>
> > writer.fs.uri=${fs.uri}
>
> >
>
> > job.name=CompactKafkaMR
>
> > job.group=PNDA
>
> >
>
> > mr.job.max.mappers=5
>
> >
>
> > compaction.datasets.finder=gobblin.compaction.dataset.
>
> > TimeBasedSubDirDatasetsFinder
>
> > compaction.input.dir=/user/pnda/PNDA_datasets/datasets
>
> > compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
>
> > compaction.input.subdir=.
>
> > compaction.dest.subdir=.
>
> > compaction.timebased.folder.pattern='year='YYYY/'month='
>
> > MM/'day='dd/'hour='HH
>
> > compaction.timebased.max.time.ago=10d
>
> > compaction.timebased.min.time.ago=1h
>
> > compaction.input.deduplicated=true
>
> > compaction.output.deduplicated=true
>
> > compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
>
> > MRCompactorTimeBasedJobPropCreator
>
> > compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
>
> > MRCompactorAvroKeyDedupJobRunner
>
> > compaction.timezone=UTC
>
> > compaction.job.overwrite.output.dir=true
>
> > compaction.recompact.from.input.for.late.data=true
>
> >
>
> >
>
> > I tried these options to no success–
>
> >
>
> >
>
> > mapreduce.output.fileoutputformat.compress=true
>
> >
>
> > mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
>
> > SnappyCodec
>
> >
>
> > mapreduce.output.fileoutputformat.compress.type=RECORD
>
> >
>
> > writer.output.format=AVRO
>
> >
>
> >
>
> >
>
> > writer.codec.type=SNAPPY
>
> >
>
> > writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
> >
>
> >
>
> > Kindly let know how to proceed on this. Am I missing some configuration
>
> > parameters?
>
> >
>
> > Thanks
>
> > Sushant Pandey
>
> >
>
> >
>
>
>
>
>
>
>

Re: Compaction job output not compressed

Reply via email to