Re: how to merge small parqut files in the hudi location

Vinoth Chandar Thu, 04 Apr 2019 09:03:25 -0700

Hi rahul,

Can you paste logs related to HoodieCleaner? That could give us clues


Thanks
Vinoth

On Wed, Apr 3, 2019 at 11:00 PM [email protected] <
[email protected]> wrote:

>
>
> On 2019/04/04 00:41:15, Vinoth Chandar <[email protected]> wrote:
> > Hi Rahul,
> >
> > Sorry not following fully.. Are you saying cleaning is not triggered at
> all
> > or is cleaner not reclaiming older files? This definitely should be
> > working,. So its mostly some config issue
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 3, 2019 at 6:27 AM [email protected] <
> > [email protected]> wrote:
> >
> > >
> > >
> > > On 2019/03/13 12:57:59, [email protected] <
> [email protected]>
> > > wrote:
> > > >
> > > >
> > > > On 2019/03/13 08:42:13, Vinoth Chandar <[email protected]> wrote:
> > > > > Hi Rahul,
> > > > >
> > > > > Good to know. Yes for copy_on_write please turn off inline
> compaction.
> > > > > (Probably explains why the default was false).
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/03/12 23:04:43, Vinoth Chandar <[email protected]>
> wrote:
> > > > > > > Opened up https://github.com/uber/hudi/pull/599/files to
> improve
> > > this
> > > > > > > out-of-box
> > > > > > >
> > > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Rahul,
> > > > > > > >
> > > > > > > > The files you shared all belong to same file group (they
> share
> > > the same
> > > > > > > > prefix if you notice) (
> > > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > > ).
> > > > > > > > Given its not creating new file groups every run, means the
> > > feature is
> > > > > > > > kicking in.
> > > > > > > >
> > > > > > > > During each insert, Hudi will find the latest file in each
> file
> > > group
> > > > > > (I,e
> > > > > > > > the one with largest instant time, timestamp) and
> rewrite/expand
> > > that
> > > > > > with
> > > > > > > > the new inserts. Hudi does not clean up the old files
> > > immediately,
> > > > > > since
> > > > > > > > that can cause running queries to fail, since they could have
> > > started
> > > > > > even
> > > > > > > > hours ago (e.g Hive).
> > > > > > > >
> > > > > > > > If you want to reduce the number of files you see, you can
> lower
> > > > > > number of
> > > > > > > > commits retained
> > > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > > We retain 24 by default.. i.e after the 25th file, the first
> one
> > > will
> > > > > > be
> > > > > > > > automatically cleaned..
> > > > > > > >
> > > > > > > > Does that make sense? Are you able to query this data and
> find
> > > the
> > > > > > > > expected records?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <[email protected]>
> > > wrote:
> > > > > > > >> > Hi Rahul,
> > > > > > > >> >
> > > > > > > >> > Hudi/Copy-on-write storage would keep expanding your
> existing
> > > > > > parquet
> > > > > > > >> files
> > > > > > > >> > to reach the configured file size, once you set the small
> > > file size
> > > > > > > >> > config..
> > > > > > > >> >
> > > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> that,
> > > you
> > > > > > could
> > > > > > > >> set
> > > > > > > >> > something like this.
> > > > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize
> =
> > > 1 *
> > > > > > 1024 *
> > > > > > > >> 1024
> > > > > > > >> > * 1024
> > > > > > > >> >
> > > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > > =
> > > > > > > >> 900 *
> > > > > > > >> > 1024 * 1024
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Please let me know if you have trouble achieving this.
> Also
> > > please
> > > > > > use
> > > > > > > >> the
> > > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Thanks
> > > > > > > >> > Vinoth
> > > > > > > >> >
> > > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected]
> <
> > > > > > > >> > [email protected]> wrote:
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <
> [email protected]>
> > > wrote:
> > > > > > > >> > > > Hi Rahul,
> > > > > > > >> > > >
> > > > > > > >> > > > you can try adding
> > > hoodie.parquet.small.file.limit=104857600, to
> > > > > > > >> your
> > > > > > > >> > > > property file to specify 100MB files. Note that this
> > > works only
> > > > > > if
> > > > > > > >> you
> > > > > > > >> > > are
> > > > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> > > enforce file
> > > > > > > >> sizing
> > > > > > > >> > > on
> > > > > > > >> > > > ingest time. As of now, there is no support for
> > > collapsing these
> > > > > > > >> file
> > > > > > > >> > > > groups (parquet + related log files) into a large file
> > > group
> > > > > > > >> (HIP/Design
> > > > > > > >> > > > may come soon). Does that help?
> > > > > > > >> > > >
> > > > > > > >> > > > Also on the compaction in general, since you don't
> have
> > > any
> > > > > > updates.
> > > > > > > >> > > > I think you can simply use the copy_on_write storage?
> > > inserts
> > > > > > will
> > > > > > > >> go to
> > > > > > > >> > > > the parquet file anyway on MOR..(but if you like to be
> > > able to
> > > > > > deal
> > > > > > > >> with
> > > > > > > >> > > > updates later, understand where you are going)
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks
> > > > > > > >> > > > Vinoth
> > > > > > > >> > > >
> > > > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM
> [email protected] <
> > > > > > > >> > > > [email protected]> wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Dear All
> > > > > > > >> > > > >
> > > > > > > >> > > > > I am using DeltaStreamer to stream the data from
> kafka
> > > topic
> > > > > > and
> > > > > > > >> to
> > > > > > > >> > > write
> > > > > > > >> > > > > it into the hudi data set.
> > > > > > > >> > > > > For this use case I am not doing any upsert all are
> > > insert
> > > > > > only
> > > > > > > >> so each
> > > > > > > >> > > > > job creates new parquet file after the inject job.
> So
> > > large
> > > > > > > >> number of
> > > > > > > >> > > > > small files are creating. how can i  merge these
> files
> > > from
> > > > > > > >> > > deltastreamer
> > > > > > > >> > > > > job using the available configurations.
> > > > > > > >> > > > >
> > > > > > > >> > > > > I think compactionSmallFileSize may useful for this
> > > case,
> > > > > > but i
> > > > > > > >> am not
> > > > > > > >> > > > > sure whether it is for deltastreamer or not. I
> tried it
> > > in
> > > > > > > >> > > deltastreamer
> > > > > > > >> > > > > but it did't worked. Please assist on this. If
> possible
> > > give
> > > > > > one
> > > > > > > >> > > example
> > > > > > > >> > > > > for the same
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks & Regards
> > > > > > > >> > > > > Rahul
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Dear Vinoth
> > > > > > > >> > >
> > > > > > > >> > > For one of my use case , I doing only inserts.For
> testing i
> > > am
> > > > > > > >> inserting
> > > > > > > >> > > data which have 5-10 records only. I  am continuously
> > > pushing
> > > > > > data to
> > > > > > > >> hudi
> > > > > > > >> > > dataset. As it is insert only for every insert it's
> > > creating  new
> > > > > > > >> small
> > > > > > > >> > > files to the dataset.
> > > > > > > >> > >
> > > > > > > >> > > If my insertion interval is less and i am planning for
> data
> > > to
> > > > > > keep
> > > > > > > >> for
> > > > > > > >> > > years, this flow will create lots of small files.
> > > > > > > >> > > I just want to know whether hudi can merge these small
> > > files in
> > > > > > any
> > > > > > > >> ways.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Thanks & Regards
> > > > > > > >> > > Rahul P
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> Dear Vinoth
> > > > > > > >>
> > > > > > > >> I tried below configurations.
> > > > > > > >>
> > > > > > > >> hoodie.parquet.max.file.size=1073741824
> > > > > > > >> hoodie.parquet.small.file.limit=943718400
> > > > > > > >>
> > > > > > > >> I am using below code for inserting data from json kafka
> source.
> > > > > > > >>
> > > > > > > >> spark-submit --class
> > > > > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
> > > --source-class
> > > > > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > > > > --source-ordering-field
> > > > > > > >> stype  --target-base-path /MERGE --target-table MERGE
> --props
> > > > > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
> --op
> > > insert
> > > > > > > >>
> > > > > > > >> But for each insert job it's creating new parquet file.
> It's not
> > > > > > touching
> > > > > > > >> old parquet files.
> > > > > > > >>
> > > > > > > >> For reference i am  sharing  some of the parquet files of
> hudi
> > > dataset
> > > > > > > >> which are generating as part of DeltaStreamer data
> insertion.
> > > > > > > >>
> > > > > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > > > > >>
> > > > > > > >> Each job it's creating files of 424K & it's not merging any.
> > > Can you
> > > > > > > >> please confirm whether hudi can achieve the use case which i
> > > > > > mentioned. If
> > > > > > > >> this merging/compacting  feature is there, kindly tell what
> i am
> > > > > > missing
> > > > > > > >> here.
> > > > > > > >>
> > > > > > > >> Thanks & Regards
> > > > > > > >> Rahul
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > > > Dear Vinoth
> > > > > >
> > > > > > I too verified that  the feature is kicking in.
> > > > > > I am using below properties and my insert job is running with 10S
> > > interval.
> > > > > >
> > > > > > hoodie.cleaner.commits.retained=6
> > > > > > hoodie.keep.max.commits=6
> > > > > > hoodie.keep.min.commits=3
> > > > > > hoodie.parquet.small.file.limit=943718400
> > > > > > hoodie.parquet.max.file.size=1073741824
> > > > > > hoodie.compact.inline=false
> > > > > >
> > > > > > Now i can see about 180 files in the hudi data set with
> > > > > > hoodie.compact.inline=false.
> > > > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > > > 181
> > > > > >
> > > > > > if set hoodie.compact.inline=true
> > > > > > i am getiing below error
> > > > > >
> > > > > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > > > > [20190313131254__commit__COMPLETED],
> > > [20190313131316__clean__COMPLETED],
> > > > > > [20190313131316__commit__COMPLETED],
> > > [20190313131339__clean__COMPLETED],
> > > > > > [20190313131339__commit__COMPLETED],
> > > [20190313131401__clean__COMPLETED],
> > > > > > [20190313131401__commit__COMPLETED],
> > > [20190313131423__clean__COMPLETED],
> > > > > > [20190313131423__commit__COMPLETED],
> > > [20190313131445__clean__COMPLETED],
> > > > > > [20190313131445__commit__COMPLETED],
> > > [20190313131512__commit__COMPLETED]]
> > > > > > Exception in thread "main"
> > > > > > com.uber.hoodie.exception.HoodieNotSupportedException:
> Compaction is
> > > not
> > > > > > supported from a CopyOnWrite table
> > > > > >
> > > > > >         at
> > > > > >
> > >
> com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > > > > >
> > > > > >
> > > > > > please assist on this.
> > > > > >
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > Dear Vinod
> > > >
> > > > Previous mail i already  mentioned i am seeing more than 180 parquet
> > > files.
> > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > 181
> > > >
> > > > I given commit to retain as 6(hoodie.cleaner.commits.retained=6)
> only.
> > > why then 181 files are coming. I am facing problem at this point.
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> > > Dear Vinod
> > >
> > > Now also i am facing same issue on COW table.  I think the clear job
> will
> > > invoke while spark-hudi loading time. But the old commit's  parquet
> files
> > > are still there. it's not cleaning. Can you please assist on this.
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> >
>
> Dear Vinod
>
> as per the config document i am using hoodie.cleaner.commits.retained=6
> for my test case.
> but i can see more than 6 parquet files for same file group.
>
> i too also think it's config issue. So please tell any other config is
> affecting this. Less size data only i am using to check this case.
>
> while checking some code i have  found String
> CLEANER_FILE_VERSIONS_RETAINED_PROP =
>       "hoodie.cleaner.fileversions" + ".retained";
> it's not mentioned in config document.  So i tried
> hoodie.cleaner.fileversions.retained=6 also, but same issue.
>
> sample files in my hudi location
>
> 434616 2019-04-03 19:32
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193219.parquet
> 434722 2019-04-03 19:32
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193240.parquet
> 434765 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193258.parquet
> 434855 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193319.parquet
> 434903 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193341.parquet
> 434954 2019-04-03 19:34
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193400.parquet
> 434992 2019-04-03 19:35
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193545.parquet
> 435030 2019-04-03 19:36
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193640.parquet
>
>
> please assist on this. As it is a basic feature i am not able to go
> further.
>
> Thanks & Regards
> Rahul
>

Re: how to merge small parqut files in the hudi location

Reply via email to