Hi Abhishek, As discussed, I tried following parameters to achieve compression –
mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec mapred.output.compression.type=BLOCK I was able to achieve to the compression, but with Deflate codec even though I specified Snappy. I suspect codec and type parameters are not propagated to Hadoop libs and thus it defaults to Defalte which is set as default codec in my setup. Is there a workaround available to beat this behavior? Thanks sushant From: Abhishek Tiwari<mailto:[email protected]> Sent: Thursday, January 25, 2018 4:38 PM To: Sushant Pandey<mailto:[email protected]> Cc: [email protected]<mailto:[email protected]> Subject: Re: Compaction job output not compressed This looks bit weird, all you generally need is: writer.output.format=AVRO writer.builder.class=gobblin.writer.AvroDataWriterBuilder and optionally, writer.codec.type=SNAPPY .. which if not specified defaults to deflate. I double checked our compaction pipeline output through avro-tools and it looks good as well. Not sure why you are not seeing the compression. Can you share the job file as well, and I will try to run a clean deployment of compaction to see if I can reproduce this. What is the Gobblin version that you are using? Regards, Abhishek On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey <[email protected]<mailto:[email protected]>> wrote: Thanks Abhishek for the response. I have a follow up query. The AVRO meta information in input file has avro.codec property ‘s value set to snappy, where as the output file has this set to null, please see the attach (for file size comparison and avro codec value). This behavior persists even if parameters compaction.input.deduplicated and compaction.input.deduplicated are set to false. Is this an expected behavior? Thanks sushant From: Abhishek Tiwari<mailto:[email protected]> Sent: Thursday, January 25, 2018 4:14 AM To: [email protected]<mailto:[email protected]> Subject: Re: Compaction job output not compressed Hi Sushant, Just to set the expectations, compaction job removes duplicate records. The way it does is by comparing 'key' field's values. If the key fields are not specified, it compares the whole record for comparison sake. Having said that, do you have duplicate records that are not getting removed? If so, what does logs say? Regards, Abhishek On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey < [email protected]<mailto:[email protected]>> wrote: > Hi, > > I am executing a compaction job on snappy compressed Avro files. Though > the job is executing successfully the output is not compressed. Following > is my configuration for compaction job – > > fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020 > writer.fs.uri=${fs.uri} > > job.name<http://job.name>=CompactKafkaMR > job.group=PNDA > > mr.job.max.mappers=5 > > compaction.datasets.finder=gobblin.compaction.dataset. > TimeBasedSubDirDatasetsFinder > compaction.input.dir=/user/pnda/PNDA_datasets/datasets > compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8 > compaction.input.subdir=. > compaction.dest.subdir=. > compaction.timebased.folder.pattern='year='YYYY/'month=' > MM/'day='dd/'hour='HH > compaction.timebased.max.time.ago=10d > compaction.timebased.min.time.ago=1h > compaction.input.deduplicated=true > compaction.output.deduplicated=true > compaction.jobprops.creator.class=gobblin.compaction.mapreduce. > MRCompactorTimeBasedJobPropCreator > compaction.job.runner.class=gobblin.compaction.mapreduce.avro. > MRCompactorAvroKeyDedupJobRunner > compaction.timezone=UTC > compaction.job.overwrite.output.dir=true > compaction.recompact.from.input.for.late.data=true > > > I tried these options to no success– > > > mapreduce.output.fileoutputformat.compress=true > > mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress. > SnappyCodec > > mapreduce.output.fileoutputformat.compress.type=RECORD > > writer.output.format=AVRO > > > > writer.codec.type=SNAPPY > > writer.builder.class=gobblin.writer.AvroDataWriterBuilder > > > Kindly let know how to proceed on this. Am I missing some configuration > parameters? > > Thanks > Sushant Pandey > >
