Please find the job file in the attach. I am using Gobblin 0.11.0.
Gobblin build info-
Have tried this on Gobblin build against hadoop versions 2.7.3.2.6.3.0-235 and
2.6.0-cdh5.9.0
Thanks
sushant
From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:38 PM
To: Sushant Pandey<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed
This looks bit weird, all you generally need is:
writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
and optionally,
writer.codec.type=SNAPPY
.. which if not specified defaults to deflate.
I double checked our compaction pipeline output through avro-tools and it looks
good as well. Not sure why you are not seeing the compression. Can you share
the job file as well, and I will try to run a clean deployment of compaction to
see if I can reproduce this.
What is the Gobblin version that you are using?
Regards,
Abhishek
On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey
<[email protected]<mailto:[email protected]>> wrote:
Thanks Abhishek for the response.
I have a follow up query. The AVRO meta information in input file has
avro.codec property ‘s value set to snappy, where as the output file has this
set to null, please see the attach (for file size comparison and avro codec
value). This behavior persists even if parameters compaction.input.deduplicated
and compaction.input.deduplicated are set to false.
Is this an expected behavior?
Thanks
sushant
From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:14 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed
Hi Sushant,
Just to set the expectations, compaction job removes duplicate records. The
way it does is by comparing 'key' field's values. If the key fields are not
specified, it compares the whole record for comparison sake. Having said
that, do you have duplicate records that are not getting removed? If so,
what does logs say?
Regards,
Abhishek
On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
[email protected]<mailto:[email protected]>> wrote:
> Hi,
>
> I am executing a compaction job on snappy compressed Avro files. Though
> the job is executing successfully the output is not compressed. Following
> is my configuration for compaction job –
>
> fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
> writer.fs.uri=${fs.uri}
>
> job.name<http://job.name>=CompactKafkaMR
> job.group=PNDA
>
> mr.job.max.mappers=5
>
> compaction.datasets.finder=gobblin.compaction.dataset.
> TimeBasedSubDirDatasetsFinder
> compaction.input.dir=/user/pnda/PNDA_datasets/datasets
> compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
> compaction.input.subdir=.
> compaction.dest.subdir=.
> compaction.timebased.folder.pattern='year='YYYY/'month='
> MM/'day='dd/'hour='HH
> compaction.timebased.max.time.ago=10d
> compaction.timebased.min.time.ago=1h
> compaction.input.deduplicated=true
> compaction.output.deduplicated=true
> compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
> MRCompactorTimeBasedJobPropCreator
> compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
> MRCompactorAvroKeyDedupJobRunner
> compaction.timezone=UTC
> compaction.job.overwrite.output.dir=true
> compaction.recompact.from.input.for.late.data=true
>
>
> I tried these options to no success–
>
>
> mapreduce.output.fileoutputformat.compress=true
>
> mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
> SnappyCodec
>
> mapreduce.output.fileoutputformat.compress.type=RECORD
>
> writer.output.format=AVRO
>
>
>
> writer.codec.type=SNAPPY
>
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
> Kindly let know how to proceed on this. Am I missing some configuration
> parameters?
>
> Thanks
> Sushant Pandey
>
>
fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
writer.fs.uri=${fs.uri}
job.name=CompactKafkaMR
job.group=PNDA
mr.job.max.mappers=5
compaction.datasets.finder=gobblin.compaction.dataset.TimeBasedSubDirDatasetsFinder
compaction.input.dir=/user/pnda/PNDA_datasets/datasets
compaction.dest.dir=/user/pnda/PNDA_datasets/compacted4
compaction.input.subdir=.
compaction.dest.subdir=.
compaction.timebased.folder.pattern='year='YYYY/'month='MM/'day='dd/'hour='HH
compaction.timebased.max.time.ago=100000d
compaction.timebased.min.time.ago=1h
compaction.input.deduplicated=false
compaction.output.deduplicated=false
compaction.jobprops.creator.class=gobblin.compaction.mapreduce.MRCompactorTimeBasedJobPropCreator
compaction.job.runner.class=gobblin.compaction.mapreduce.avro.MRCompactorAvroKeyDedupJobRunner
compaction.timezone=UTC
compaction.job.overwrite.output.dir=true
compaction.recompact.from.input.for.late.data=true
writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
writer.codec.type=SNAPPY