Hi Sushant, Nice to know that you were able to get it to work. I agree that probably the type params are not being propagated. So, it would need a small fix. Please feel free to submit a PR if you know what needs to be changed, or else please create a Jira for it.
Regards, Abhishek On Thu, Feb 1, 2018 at 1:24 AM, Sushant Pandey <[email protected] > wrote: > Hi Abhishek, > > > > As discussed, I tried following parameters to achieve compression – > > > > mapred.output.compress=true > > mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec > > mapred.output.compression.type=BLOCK > > > > I was able to achieve to the compression, but with Deflate codec even > though I specified Snappy. I suspect codec and type parameters are not > propagated to Hadoop libs and thus it defaults to Defalte which is set as > default codec in my setup. > > > > Is there a workaround available to beat this behavior? > > > > Thanks > > sushant > > > > *From: *Abhishek Tiwari <[email protected]> > *Sent: *Thursday, January 25, 2018 4:38 PM > *To: *Sushant Pandey <[email protected]> > *Cc: *[email protected] > *Subject: *Re: Compaction job output not compressed > > > > This looks bit weird, all you generally need is: > > > > writer.output.format=AVRO > writer.builder.class=gobblin.writer.AvroDataWriterBuilder > > > > and optionally, > > writer.codec.type=SNAPPY > > > > .. which if not specified defaults to deflate. > > > > I double checked our compaction pipeline output through avro-tools and it > looks good as well. Not sure why you are not seeing the compression. Can > you share the job file as well, and I will try to run a clean deployment of > compaction to see if I can reproduce this. > > > > What is the Gobblin version that you are using? > > > > Regards, > > Abhishek > > > > On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey < > [email protected]> wrote: > > Thanks Abhishek for the response. > > > > I have a follow up query. The AVRO meta information in input file has > *avro.codec* property ‘s value set to *snappy*, where as the output file > has this set to *null*, please see the attach (for file size comparison > and avro codec value). This behavior persists even if parameters > *compaction.input.deduplicated* and *compaction.input.deduplicated* are > set to *false. * > > > > Is this an expected behavior? > > > > Thanks > > sushant > > > > > > > > *From: *Abhishek Tiwari <[email protected]> > *Sent: *Thursday, January 25, 2018 4:14 AM > *To: *[email protected] > *Subject: *Re: Compaction job output not compressed > > > > Hi Sushant, > > > > Just to set the expectations, compaction job removes duplicate records. The > > way it does is by comparing 'key' field's values. If the key fields are not > > specified, it compares the whole record for comparison sake. Having said > > that, do you have duplicate records that are not getting removed? If so, > > what does logs say? > > > > Regards, > > Abhishek > > > > On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey < > > [email protected]> wrote: > > > > > Hi, > > > > > > I am executing a compaction job on snappy compressed Avro files. Though > > > the job is executing successfully the output is not compressed. Following > > > is my configuration for compaction job – > > > > > > fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020 > > > writer.fs.uri=${fs.uri} > > > > > > job.name=CompactKafkaMR > > > job.group=PNDA > > > > > > mr.job.max.mappers=5 > > > > > > compaction.datasets.finder=gobblin.compaction.dataset. > > > TimeBasedSubDirDatasetsFinder > > > compaction.input.dir=/user/pnda/PNDA_datasets/datasets > > > compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8 > > > compaction.input.subdir=. > > > compaction.dest.subdir=. > > > compaction.timebased.folder.pattern='year='YYYY/'month=' > > > MM/'day='dd/'hour='HH > > > compaction.timebased.max.time.ago=10d > > > compaction.timebased.min.time.ago=1h > > > compaction.input.deduplicated=true > > > compaction.output.deduplicated=true > > > compaction.jobprops.creator.class=gobblin.compaction.mapreduce. > > > MRCompactorTimeBasedJobPropCreator > > > compaction.job.runner.class=gobblin.compaction.mapreduce.avro. > > > MRCompactorAvroKeyDedupJobRunner > > > compaction.timezone=UTC > > > compaction.job.overwrite.output.dir=true > > > compaction.recompact.from.input.for.late.data=true > > > > > > > > > I tried these options to no success– > > > > > > > > > mapreduce.output.fileoutputformat.compress=true > > > > > > mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress. > > > SnappyCodec > > > > > > mapreduce.output.fileoutputformat.compress.type=RECORD > > > > > > writer.output.format=AVRO > > > > > > > > > > > > writer.codec.type=SNAPPY > > > > > > writer.builder.class=gobblin.writer.AvroDataWriterBuilder > > > > > > > > > Kindly let know how to proceed on this. Am I missing some configuration > > > parameters? > > > > > > Thanks > > > Sushant Pandey > > > > > > > > > > > > >
