Hi Abhishek,

As discussed, I tried following parameters to achieve compression –

mapred.output.compress=true
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
mapred.output.compression.type=BLOCK

I was able to achieve to the compression, but with Deflate codec even though I 
specified Snappy. I suspect codec and type parameters are not propagated to 
Hadoop libs and thus it defaults to Defalte which is set as default codec in my 
setup.

Is there a workaround available to beat this behavior?

Thanks
sushant

From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:38 PM
To: Sushant Pandey<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed

This looks bit weird, all you generally need is:

writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder

and optionally,
writer.codec.type=SNAPPY

.. which if not specified defaults to deflate.

I double checked our compaction pipeline output through avro-tools and it looks 
good as well. Not sure why you are not seeing the compression. Can you share 
the job file as well, and I will try to run a clean deployment of compaction to 
see if I can reproduce this.

What is the Gobblin version that you are using?

Regards,
Abhishek

On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Abhishek for the response.

I have a follow up query. The AVRO meta information in input file has 
avro.codec property ‘s value set to snappy, where as the output file has this 
set to null, please see the attach (for file size comparison and avro codec 
value). This behavior persists even if parameters compaction.input.deduplicated 
and compaction.input.deduplicated are set to false.

Is this an expected behavior?

Thanks
sushant



From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:14 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed

Hi Sushant,

Just to set the expectations, compaction job removes duplicate records. The
way it does is by comparing 'key' field's values. If the key fields are not
specified, it compares the whole record for comparison sake. Having said
that, do you have duplicate records that are not getting removed? If so,
what does logs say?

Regards,
Abhishek

On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
[email protected]<mailto:[email protected]>> wrote:

> Hi,
>
> I am executing a compaction job on snappy compressed Avro files. Though
> the job is executing successfully the output is not compressed. Following
> is my configuration for compaction job –
>
> fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
> writer.fs.uri=${fs.uri}
>
> job.name<http://job.name>=CompactKafkaMR
> job.group=PNDA
>
> mr.job.max.mappers=5
>
> compaction.datasets.finder=gobblin.compaction.dataset.
> TimeBasedSubDirDatasetsFinder
> compaction.input.dir=/user/pnda/PNDA_datasets/datasets
> compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
> compaction.input.subdir=.
> compaction.dest.subdir=.
> compaction.timebased.folder.pattern='year='YYYY/'month='
> MM/'day='dd/'hour='HH
> compaction.timebased.max.time.ago=10d
> compaction.timebased.min.time.ago=1h
> compaction.input.deduplicated=true
> compaction.output.deduplicated=true
> compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
> MRCompactorTimeBasedJobPropCreator
> compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
> MRCompactorAvroKeyDedupJobRunner
> compaction.timezone=UTC
> compaction.job.overwrite.output.dir=true
> compaction.recompact.from.input.for.late.data=true
>
>
> I tried these options to no success–
>
>
> mapreduce.output.fileoutputformat.compress=true
>
> mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
> SnappyCodec
>
> mapreduce.output.fileoutputformat.compress.type=RECORD
>
> writer.output.format=AVRO
>
>
>
> writer.codec.type=SNAPPY
>
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
> Kindly let know how to proceed on this. Am I missing some configuration
> parameters?
>
> Thanks
> Sushant Pandey
>
>



Reply via email to