Thanks Abhishek for the response.
I have a follow up query. The AVRO meta information in input file has
avro.codec property ‘s value set to snappy, where as the output file has this
set to null, please see the attach (for file size comparison and avro codec
value). This behavior persists even if parameters compaction.input.deduplicated
and compaction.input.deduplicated are set to false.
Is this an expected behavior?
Thanks
sushant
From: Abhishek Tiwari<mailto:[email protected]>
Sent: Thursday, January 25, 2018 4:14 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Compaction job output not compressed
Hi Sushant,
Just to set the expectations, compaction job removes duplicate records. The
way it does is by comparing 'key' field's values. If the key fields are not
specified, it compares the whole record for comparison sake. Having said
that, do you have duplicate records that are not getting removed? If so,
what does logs say?
Regards,
Abhishek
On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
[email protected]> wrote:
> Hi,
>
> I am executing a compaction job on snappy compressed Avro files. Though
> the job is executing successfully the output is not compressed. Following
> is my configuration for compaction job –
>
> fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
> writer.fs.uri=${fs.uri}
>
> job.name=CompactKafkaMR
> job.group=PNDA
>
> mr.job.max.mappers=5
>
> compaction.datasets.finder=gobblin.compaction.dataset.
> TimeBasedSubDirDatasetsFinder
> compaction.input.dir=/user/pnda/PNDA_datasets/datasets
> compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
> compaction.input.subdir=.
> compaction.dest.subdir=.
> compaction.timebased.folder.pattern='year='YYYY/'month='
> MM/'day='dd/'hour='HH
> compaction.timebased.max.time.ago=10d
> compaction.timebased.min.time.ago=1h
> compaction.input.deduplicated=true
> compaction.output.deduplicated=true
> compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
> MRCompactorTimeBasedJobPropCreator
> compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
> MRCompactorAvroKeyDedupJobRunner
> compaction.timezone=UTC
> compaction.job.overwrite.output.dir=true
> compaction.recompact.from.input.for.late.data=true
>
>
> I tried these options to no success–
>
>
> mapreduce.output.fileoutputformat.compress=true
>
> mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
> SnappyCodec
>
> mapreduce.output.fileoutputformat.compress.type=RECORD
>
> writer.output.format=AVRO
>
>
>
> writer.codec.type=SNAPPY
>
> writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
>
> Kindly let know how to proceed on this. Am I missing some configuration
> parameters?
>
> Thanks
> Sushant Pandey
>
>
//Size of input file//
sudo -su hdfs hdfs dfs -ls
/user/pnda/PNDA_datasets/datasets/source=collectd/year=1974/month=10/day=22/hour=12
Found 2 items
-rw-r--r-- 1 pnda pnda 2607 2018-01-25 07:31
/user/pnda/PNDA_datasets/datasets/source=collectd/year=1974/month=10/day=22/hour=12/5bd0e3f5-1931-4025-8b37-76e6f83f9e21.avro
-rw-r--r-- 1 pnda pnda 2304 2018-01-25 07:31
/user/pnda/PNDA_datasets/datasets/source=collectd/year=1974/month=10/day=22/hour=12/9806c063-67dc-4984-9527-79a3c9753f12.avro
//Avro meta of input files//
$ avro-tools getmeta 5bd0e3f5-1931-4025-8b37-76e6f83f9e21.avro
avro.schema
{"type":"record","name":"event","namespace":"pnda.entity","fields":[{"name":"timestamp","type":"long"},{"name":"src","type":"string"},{"name":"host_ip","type":"string"},{"name":"rawdata","type":"bytes"}]}
avro.codec snappy
$ avro-tools getmeta 9806c063-67dc-4984-9527-79a3c9753f12.avro
avro.schema
{"type":"record","name":"event","namespace":"pnda.entity","fields":[{"name":"timestamp","type":"long"},{"name":"src","type":"string"},{"name":"host_ip","type":"string"},{"name":"rawdata","type":"bytes"}]}
avro.codec snappy
//Size of output file//
$ sudo -su hdfs hdfs dfs -ls
/user/pnda/PNDA_datasets/compacted/source=collectd/year=1974/month=10/day=22/hour=12
Found 3 items
-rw-r--r-- 1 pnda supergroup 8 2018-01-25 07:42
/user/pnda/PNDA_datasets/compacted/source=collectd/year=1974/month=10/day=22/hour=12/_COMPACTION_COMPLETE
-rw-r--r-- 1 pnda supergroup 0 2018-01-25 07:42
/user/pnda/PNDA_datasets/compacted/source=collectd/year=1974/month=10/day=22/hour=12/_SUCCESS
-rw-r--r-- 1 pnda supergroup 28479 2018-01-25 07:42
/user/pnda/PNDA_datasets/compacted/source=collectd/year=1974/month=10/day=22/hour=12/part-m-199.1516866138487.523930798.avro
//Avro meta of output file//
$ avro-tools getmeta part-m-199.1516865921375.1782794325.avro
avro.schema
{"type":"record","name":"event","namespace":"pnda.entity","fields":[{"name":"timestamp","type":"long"},{"name":"src","type":"string"},{"name":"host_ip","type":"string"},{"name":"rawdata","type":"bytes"}]}
avro.codec null