RE: Compaction job output not compressed

Sushant Pandey Thu, 25 Jan 2018 05:58:40 -0800

Please find the job file in the attach. I am using Gobblin 0.11.0.

Gobblin build info-

Have tried this on Gobblin build against  hadoop versions 2.7.3.2.6.3.0-235 and 
2.6.0-cdh5.9.0

Thanks

sushant

From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:38 PM
To: Sushant Pandey<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed

This looks bit weird, all you generally need is:

writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder

and optionally,
writer.codec.type=SNAPPY

.. which if not specified defaults to deflate.

I double checked our compaction pipeline output through avro-tools and it looks 
good as well. Not sure why you are not seeing the compression. Can you share 
the job file as well, and I will try to run a clean deployment of compaction to 
see if I can reproduce this.

What is the Gobblin version that you are using?

Regards,
Abhishek

On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Abhishek for the response.

I have a follow up query. The AVRO meta information in input file has 
avro.codec property ‘s value set to snappy, where as the output file has this 
set to null, please see the attach (for file size comparison and avro codec 
value). This behavior persists even if parameters compaction.input.deduplicated 
and compaction.input.deduplicated are set to false.

Is this an expected behavior?

Thanks
sushant

From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:14 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed

Hi Sushant,

Just to set the expectations, compaction job removes duplicate records. The
way it does is by comparing 'key' field's values. If the key fields are not
specified, it compares the whole record for comparison sake. Having said
that, do you have duplicate records that are not getting removed? If so,
what does logs say?

Regards,
Abhishek

On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
[email protected]<mailto:[email protected]>> wrote:

> Hi,
>
> I am executing a compaction job on snappy compressed Avro files. Though
> the job is executing successfully the output is not compressed. Following
> is my configuration for compaction job –
>
> fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
> writer.fs.uri=${fs.uri}
>
> job.name<http://job.name>=CompactKafkaMR
> job.group=PNDA
>
> mr.job.max.mappers=5
>
> compaction.datasets.finder=gobblin.compaction.dataset.
> TimeBasedSubDirDatasetsFinder
> compaction.input.dir=/user/pnda/PNDA_datasets/datasets
> compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
> compaction.input.subdir=.
> compaction.dest.subdir=.
> compaction.timebased.folder.pattern='year='YYYY/'month='
> MM/'day='dd/'hour='HH
> compaction.timebased.max.time.ago=10d
> compaction.timebased.min.time.ago=1h
> compaction.input.deduplicated=true
> compaction.output.deduplicated=true
> compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
> MRCompactorTimeBasedJobPropCreator
> compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
> MRCompactorAvroKeyDedupJobRunner
> compaction.timezone=UTC
> compaction.job.overwrite.output.dir=true
> compaction.recompact.from.input.for.late.data=true
>
>
> I tried these options to no success–
>
>
> mapreduce.output.fileoutputformat.compress=true
>
> mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
> SnappyCodec
>
> mapreduce.output.fileoutputformat.compress.type=RECORD
>
> writer.output.format=AVRO
>
>
>
> writer.codec.type=SNAPPY
>
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
> Kindly let know how to proceed on this. Am I missing some configuration
> parameters?
>
> Thanks
> Sushant Pandey
>
>

fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
writer.fs.uri=${fs.uri}

job.name=CompactKafkaMR
job.group=PNDA

mr.job.max.mappers=5

compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder
compaction.input.dir=/user/pnda/PNDA_datasets/datasets
compaction.dest.dir=/user/pnda/PNDA_datasets/compacted4
compaction.input.subdir=.
compaction.dest.subdir=.
compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH
compaction.timebased.max.time.ago=100000d
compaction.timebased.min.time.ago=1h
compaction.input.deduplicated=false
compaction.output.deduplicated=false
compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator
compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner
compaction.timezone=UTC
compaction.job.overwrite.output.dir=true
compaction.recompact.from.input.for.late.data=true

writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
writer.codec.type=SNAPPY

RE: Compaction job output not compressed

Reply via email to