Re: how to merge small parqut files in the hudi location

Vinoth Chandar Wed, 13 Mar 2019 07:23:03 -0700

Another quick check. Are all 180 files part of the same file group i.e
begin with the same uuid prefix in its name?


On Wed, Mar 13, 2019 at 7:14 AM Vinoth Chandar <[email protected]> wrote:

> Hi rahul,
>
> From the timeline, it does seem line cleaning happens regularly.  Can you
> share the logs from the driver in a gist?
>
> Thanks
> Vinoth
>
> On Wed, Mar 13, 2019 at 5:58 AM [email protected] <
> [email protected]> wrote:
>
>>
>>
>> On 2019/03/13 08:42:13, Vinoth Chandar <[email protected]> wrote:
>> > Hi Rahul,
>> >
>> > Good to know. Yes for copy_on_write please turn off inline compaction.
>> > (Probably explains why the default was false).
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
>> > [email protected]> wrote:
>> >
>> > >
>> > >
>> > > On 2019/03/12 23:04:43, Vinoth Chandar <[email protected]> wrote:
>> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
>> this
>> > > > out-of-box
>> > > >
>> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <[email protected]>
>> > > wrote:
>> > > >
>> > > > > Hi Rahul,
>> > > > >
>> > > > > The files you shared all belong to same file group (they share
>> the same
>> > > > > prefix if you notice) (
>> > > https://hudi.apache.org/concepts.html#terminologies
>> > > > > ).
>> > > > > Given its not creating new file groups every run, means the
>> feature is
>> > > > > kicking in.
>> > > > >
>> > > > > During each insert, Hudi will find the latest file in each file
>> group
>> > > (I,e
>> > > > > the one with largest instant time, timestamp) and rewrite/expand
>> that
>> > > with
>> > > > > the new inserts. Hudi does not clean up the old files immediately,
>> > > since
>> > > > > that can cause running queries to fail, since they could have
>> started
>> > > even
>> > > > > hours ago (e.g Hive).
>> > > > >
>> > > > > If you want to reduce the number of files you see, you can lower
>> > > number of
>> > > > > commits retained
>> > > > > https://hudi.apache.org/configurations.html#retainCommits
>> > > > > We retain 24 by default.. i.e after the 25th file, the first one
>> will
>> > > be
>> > > > > automatically cleaned..
>> > > > >
>> > > > > Does that make sense? Are you able to query this data and find the
>> > > > > expected records?
>> > > > >
>> > > > > Thanks
>> > > > > Vinoth
>> > > > >
>> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
>> > > > > [email protected]> wrote:
>> > > > >
>> > > > >>
>> > > > >>
>> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <[email protected]>
>> wrote:
>> > > > >> > Hi Rahul,
>> > > > >> >
>> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
>> > > parquet
>> > > > >> files
>> > > > >> > to reach the configured file size, once you set the small file
>> size
>> > > > >> > config..
>> > > > >> >
>> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
>> > > could
>> > > > >> set
>> > > > >> > something like this.
>> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1
>> *
>> > > 1024 *
>> > > > >> 1024
>> > > > >> > * 1024
>> > > > >> >
>> http://hudi.apache.org/configurations.html#compactionSmallFileSize
>> > > =
>> > > > >> 900 *
>> > > > >> > 1024 * 1024
>> > > > >> >
>> > > > >> >
>> > > > >> > Please let me know if you have trouble achieving this. Also
>> please
>> > > use
>> > > > >> the
>> > > > >> > insert operation (not bulk_insert) for this to work
>> > > > >> >
>> > > > >> >
>> > > > >> > Thanks
>> > > > >> > Vinoth
>> > > > >> >
>> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
>> > > > >> > [email protected]> wrote:
>> > > > >> >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]>
>> wrote:
>> > > > >> > > > Hi Rahul,
>> > > > >> > > >
>> > > > >> > > > you can try adding
>> hoodie.parquet.small.file.limit=104857600, to
>> > > > >> your
>> > > > >> > > > property file to specify 100MB files. Note that this works
>> only
>> > > if
>> > > > >> you
>> > > > >> > > are
>> > > > >> > > > using insert (not bulk_insert) operation. Hudi will
>> enforce file
>> > > > >> sizing
>> > > > >> > > on
>> > > > >> > > > ingest time. As of now, there is no support for collapsing
>> these
>> > > > >> file
>> > > > >> > > > groups (parquet + related log files) into a large file
>> group
>> > > > >> (HIP/Design
>> > > > >> > > > may come soon). Does that help?
>> > > > >> > > >
>> > > > >> > > > Also on the compaction in general, since you don't have any
>> > > updates.
>> > > > >> > > > I think you can simply use the copy_on_write storage?
>> inserts
>> > > will
>> > > > >> go to
>> > > > >> > > > the parquet file anyway on MOR..(but if you like to be
>> able to
>> > > deal
>> > > > >> with
>> > > > >> > > > updates later, understand where you are going)
>> > > > >> > > >
>> > > > >> > > > Thanks
>> > > > >> > > > Vinoth
>> > > > >> > > >
>> > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
>> > > > >> > > > [email protected]> wrote:
>> > > > >> > > >
>> > > > >> > > > > Dear All
>> > > > >> > > > >
>> > > > >> > > > > I am using DeltaStreamer to stream the data from kafka
>> topic
>> > > and
>> > > > >> to
>> > > > >> > > write
>> > > > >> > > > > it into the hudi data set.
>> > > > >> > > > > For this use case I am not doing any upsert all are
>> insert
>> > > only
>> > > > >> so each
>> > > > >> > > > > job creates new parquet file after the inject job. So
>> large
>> > > > >> number of
>> > > > >> > > > > small files are creating. how can i  merge these files
>> from
>> > > > >> > > deltastreamer
>> > > > >> > > > > job using the available configurations.
>> > > > >> > > > >
>> > > > >> > > > > I think compactionSmallFileSize may useful for this case,
>> > > but i
>> > > > >> am not
>> > > > >> > > > > sure whether it is for deltastreamer or not. I tried it
>> in
>> > > > >> > > deltastreamer
>> > > > >> > > > > but it did't worked. Please assist on this. If possible
>> give
>> > > one
>> > > > >> > > example
>> > > > >> > > > > for the same
>> > > > >> > > > >
>> > > > >> > > > > Thanks & Regards
>> > > > >> > > > > Rahul
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > Dear Vinoth
>> > > > >> > >
>> > > > >> > > For one of my use case , I doing only inserts.For testing i
>> am
>> > > > >> inserting
>> > > > >> > > data which have 5-10 records only. I  am continuously pushing
>> > > data to
>> > > > >> hudi
>> > > > >> > > dataset. As it is insert only for every insert it's
>> creating  new
>> > > > >> small
>> > > > >> > > files to the dataset.
>> > > > >> > >
>> > > > >> > > If my insertion interval is less and i am planning for data
>> to
>> > > keep
>> > > > >> for
>> > > > >> > > years, this flow will create lots of small files.
>> > > > >> > > I just want to know whether hudi can merge these small files
>> in
>> > > any
>> > > > >> ways.
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > Thanks & Regards
>> > > > >> > > Rahul P
>> > > > >> > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >> Dear Vinoth
>> > > > >>
>> > > > >> I tried below configurations.
>> > > > >>
>> > > > >> hoodie.parquet.max.file.size=1073741824
>> > > > >> hoodie.parquet.small.file.limit=943718400
>> > > > >>
>> > > > >> I am using below code for inserting data from json kafka source.
>> > > > >>
>> > > > >> spark-submit --class
>> > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
>> > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
>> --source-class
>> > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
>> > > --source-ordering-field
>> > > > >> stype  --target-base-path /MERGE --target-table MERGE --props
>> > > > >> /hudi/kafka-source.properties  --schemaprovider-class
>> > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op
>> insert
>> > > > >>
>> > > > >> But for each insert job it's creating new parquet file. It's not
>> > > touching
>> > > > >> old parquet files.
>> > > > >>
>> > > > >> For reference i am  sharing  some of the parquet files of hudi
>> dataset
>> > > > >> which are generating as part of DeltaStreamer data insertion.
>> > > > >>
>> > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
>> > > > >> 424.0 K
>> > > > >>
>> > >
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
>> > > > >>
>> > > > >> Each job it's creating files of 424K & it's not merging any.
>> Can you
>> > > > >> please confirm whether hudi can achieve the use case which i
>> > > mentioned. If
>> > > > >> this merging/compacting  feature is there, kindly tell what i am
>> > > missing
>> > > > >> here.
>> > > > >>
>> > > > >> Thanks & Regards
>> > > > >> Rahul
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> > > Dear Vinoth
>> > >
>> > > I too verified that  the feature is kicking in.
>> > > I am using below properties and my insert job is running with 10S
>> interval.
>> > >
>> > > hoodie.cleaner.commits.retained=6
>> > > hoodie.keep.max.commits=6
>> > > hoodie.keep.min.commits=3
>> > > hoodie.parquet.small.file.limit=943718400
>> > > hoodie.parquet.max.file.size=1073741824
>> > > hoodie.compact.inline=false
>> > >
>> > > Now i can see about 180 files in the hudi data set with
>> > > hoodie.compact.inline=false.
>> > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
>> > > 181
>> > >
>> > > if set hoodie.compact.inline=true
>> > > i am getiing below error
>> > >
>> > >  Loaded instants [[20190313131254__clean__COMPLETED],
>> > > [20190313131254__commit__COMPLETED],
>> [20190313131316__clean__COMPLETED],
>> > > [20190313131316__commit__COMPLETED],
>> [20190313131339__clean__COMPLETED],
>> > > [20190313131339__commit__COMPLETED],
>> [20190313131401__clean__COMPLETED],
>> > > [20190313131401__commit__COMPLETED],
>> [20190313131423__clean__COMPLETED],
>> > > [20190313131423__commit__COMPLETED],
>> [20190313131445__clean__COMPLETED],
>> > > [20190313131445__commit__COMPLETED],
>> [20190313131512__commit__COMPLETED]]
>> > > Exception in thread "main"
>> > > com.uber.hoodie.exception.HoodieNotSupportedException: Compaction is
>> not
>> > > supported from a CopyOnWrite table
>> > >
>> > >         at
>> > >
>> com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
>> > >
>> > >
>> > > please assist on this.
>> > >
>> > > Thanks & Regards
>> > > Rahul
>> > >
>> > >
>> >
>>
>>
>>
>> Dear Vinod
>>
>> Previous mail i already  mentioned i am seeing more than 180 parquet
>> files.
>> hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
>> 181
>>
>> I given commit to retain as 6(hoodie.cleaner.commits.retained=6) only.
>> why then 181 files are coming. I am facing problem at this point.
>>
>> Thanks & Regards
>> Rahul
>>
>

Re: how to merge small parqut files in the hudi location

Reply via email to