Hi Sushant, Just to set the expectations, compaction job removes duplicate records. The way it does is by comparing 'key' field's values. If the key fields are not specified, it compares the whole record for comparison sake. Having said that, do you have duplicate records that are not getting removed? If so, what does logs say?
Regards, Abhishek On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey < [email protected]> wrote: > Hi, > > I am executing a compaction job on snappy compressed Avro files. Though > the job is executing successfully the output is not compressed. Following > is my configuration for compaction job – > > fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020 > writer.fs.uri=${fs.uri} > > job.name=CompactKafkaMR > job.group=PNDA > > mr.job.max.mappers=5 > > compaction.datasets.finder=gobblin.compaction.dataset. > TimeBasedSubDirDatasetsFinder > compaction.input.dir=/user/pnda/PNDA_datasets/datasets > compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8 > compaction.input.subdir=. > compaction.dest.subdir=. > compaction.timebased.folder.pattern='year='YYYY/'month=' > MM/'day='dd/'hour='HH > compaction.timebased.max.time.ago=10d > compaction.timebased.min.time.ago=1h > compaction.input.deduplicated=true > compaction.output.deduplicated=true > compaction.jobprops.creator.class=gobblin.compaction.mapreduce. > MRCompactorTimeBasedJobPropCreator > compaction.job.runner.class=gobblin.compaction.mapreduce.avro. > MRCompactorAvroKeyDedupJobRunner > compaction.timezone=UTC > compaction.job.overwrite.output.dir=true > compaction.recompact.from.input.for.late.data=true > > > I tried these options to no success– > > > mapreduce.output.fileoutputformat.compress=true > > mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress. > SnappyCodec > > mapreduce.output.fileoutputformat.compress.type=RECORD > > writer.output.format=AVRO > > > > writer.codec.type=SNAPPY > > writer.builder.class=gobblin.writer.AvroDataWriterBuilder > > > Kindly let know how to proceed on this. Am I missing some configuration > parameters? > > Thanks > Sushant Pandey > >
