Hi Sushant,

Just to set the expectations, compaction job removes duplicate records. The
way it does is by comparing 'key' field's values. If the key fields are not
specified, it compares the whole record for comparison sake. Having said
that, do you have duplicate records that are not getting removed? If so,
what does logs say?

Regards,
Abhishek

On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
[email protected]> wrote:

> Hi,
>
> I am executing a compaction job on snappy compressed Avro files. Though
> the job is executing successfully the output is not compressed. Following
> is my configuration for compaction job –
>
> fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
> writer.fs.uri=${fs.uri}
>
> job.name=CompactKafkaMR
> job.group=PNDA
>
> mr.job.max.mappers=5
>
> compaction.datasets.finder=gobblin.compaction.dataset.
> TimeBasedSubDirDatasetsFinder
> compaction.input.dir=/user/pnda/PNDA_datasets/datasets
> compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
> compaction.input.subdir=.
> compaction.dest.subdir=.
> compaction.timebased.folder.pattern='year='YYYY/'month='
> MM/'day='dd/'hour='HH
> compaction.timebased.max.time.ago=10d
> compaction.timebased.min.time.ago=1h
> compaction.input.deduplicated=true
> compaction.output.deduplicated=true
> compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
> MRCompactorTimeBasedJobPropCreator
> compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
> MRCompactorAvroKeyDedupJobRunner
> compaction.timezone=UTC
> compaction.job.overwrite.output.dir=true
> compaction.recompact.from.input.for.late.data=true
>
>
> I tried these options to no success–
>
>
> mapreduce.output.fileoutputformat.compress=true
>
> mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
> SnappyCodec
>
> mapreduce.output.fileoutputformat.compress.type=RECORD
>
> writer.output.format=AVRO
>
>
>
> writer.codec.type=SNAPPY
>
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
> Kindly let know how to proceed on this. Am I missing some configuration
> parameters?
>
> Thanks
> Sushant Pandey
>
>

Reply via email to