Re: Compaction job output not compressed

Abhishek Tiwari Thu, 25 Jan 2018 03:09:07 -0800

This looks bit weird, all you generally need is:

writer.output.format=AVRO
writer.builder.class=gobblin.writer.AvroDataWriterBuilder


and optionally,
writer.codec.type=SNAPPY

.. which if not specified defaults to deflate.

I double checked our compaction pipeline output through avro-tools and it
looks good as well. Not sure why you are not seeing the compression. Can
you share the job file as well, and I will try to run a clean deployment of
compaction to see if I can reproduce this.

What is the Gobblin version that you are using?

Regards,
Abhishek

On Wed, Jan 24, 2018 at 11:56 PM, Sushant Pandey <
[email protected]> wrote:

> Thanks Abhishek for the response.
>
>
>
> I have a follow up query. The AVRO meta information in input file has
> *avro.codec* property ‘s value set to *snappy*, where as the output file
> has this set to *null*, please see the attach (for file size comparison
> and avro codec value). This behavior persists even if parameters
> *compaction.input.deduplicated* and *compaction.input.deduplicated* are
> set to *false. *
>
>
>
> Is this an expected behavior?
>
>
>
> Thanks
>
> sushant
>
>
>
>
>
>
>
> *From: *Abhishek Tiwari <[email protected]>
> *Sent: *Thursday, January 25, 2018 4:14 AM
> *To: *[email protected]
> *Subject: *Re: Compaction job output not compressed
>
>
>
> Hi Sushant,
>
>
>
> Just to set the expectations, compaction job removes duplicate records. The
>
> way it does is by comparing 'key' field's values. If the key fields are not
>
> specified, it compares the whole record for comparison sake. Having said
>
> that, do you have duplicate records that are not getting removed? If so,
>
> what does logs say?
>
>
>
> Regards,
>
> Abhishek
>
>
>
> On Mon, Jan 22, 2018 at 3:43 AM, Sushant Pandey <
>
> [email protected]> wrote:
>
>
>
> > Hi,
>
> >
>
> > I am executing a compaction job on snappy compressed Avro files. Though
>
> > the job is executing successfully the output is not compressed. Following
>
> > is my configuration for compaction job –
>
> >
>
> > fs.uri=hdfs://hdp-ubuntu-hadoop-mgr-1:8020
>
> > writer.fs.uri=${fs.uri}
>
> >
>
> > job.name=CompactKafkaMR
>
> > job.group=PNDA
>
> >
>
> > mr.job.max.mappers=5
>
> >
>
> > compaction.datasets.finder=gobblin.compaction.dataset.
>
> > TimeBasedSubDirDatasetsFinder
>
> > compaction.input.dir=/user/pnda/PNDA_datasets/datasets
>
> > compaction.dest.dir=/user/pnda/PNDA_datasets/compacted8
>
> > compaction.input.subdir=.
>
> > compaction.dest.subdir=.
>
> > compaction.timebased.folder.pattern='year='YYYY/'month='
>
> > MM/'day='dd/'hour='HH
>
> > compaction.timebased.max.time.ago=10d
>
> > compaction.timebased.min.time.ago=1h
>
> > compaction.input.deduplicated=true
>
> > compaction.output.deduplicated=true
>
> > compaction.jobprops.creator.class=gobblin.compaction.mapreduce.
>
> > MRCompactorTimeBasedJobPropCreator
>
> > compaction.job.runner.class=gobblin.compaction.mapreduce.avro.
>
> > MRCompactorAvroKeyDedupJobRunner
>
> > compaction.timezone=UTC
>
> > compaction.job.overwrite.output.dir=true
>
> > compaction.recompact.from.input.for.late.data=true
>
> >
>
> >
>
> > I tried these options to no success–
>
> >
>
> >
>
> > mapreduce.output.fileoutputformat.compress=true
>
> >
>
> > mapreduce.output.fileoutputformat.compress.codec=hadoop.io.compress.
>
> > SnappyCodec
>
> >
>
> > mapreduce.output.fileoutputformat.compress.type=RECORD
>
> >
>
> > writer.output.format=AVRO
>
> >
>
> >
>
> >
>
> > writer.codec.type=SNAPPY
>
> >
>
> > writer.builder.class=gobblin.writer.AvroDataWriterBuilder
>
> >
>
> >
>
> > Kindly let know how to proceed on this. Am I missing some configuration
>
> > parameters?
>
> >
>
> > Thanks
>
> > Sushant Pandey
>
> >
>
> >
>
>
>

Re: Compaction job output not compressed

Reply via email to