Re: how to merge small parqut files in the hudi location

Vinoth Chandar Tue, 12 Mar 2019 13:27:35 -0700

Hi Rahul,

The files you shared all belong to same file group (they share the same
prefix if you notice) (https://hudi.apache.org/concepts.html#terminologies
).
Given its not creating new file groups every run, means the feature is
kicking in.


During each insert, Hudi will find the latest file in each file group (I,e
the one with largest instant time, timestamp) and rewrite/expand that with
the new inserts. Hudi does not clean up the old files immediately, since
that can cause running queries to fail, since they could have started even
hours ago (e.g Hive).

If you want to reduce the number of files you see, you can lower number of
commits retained
https://hudi.apache.org/configurations.html#retainCommits
We retain 24 by default.. i.e after the 25th file, the first one will be
automatically cleaned..

Does that make sense? Are you able to query this data and find the expected
records?

Thanks
Vinoth

On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/11 18:25:46, Vinoth Chandar <[email protected]> wrote:
> > Hi Rahul,
> >
> > Hudi/Copy-on-write storage would keep expanding your existing parquet
> files
> > to reach the configured file size, once you set the small file size
> > config..
> >
> > For e.g: we at uber, write 1GB files this way.. to do that, you could set
> > something like this.
> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 *
> 1024
> > * 1024
> > http://hudi.apache.org/configurations.html#compactionSmallFileSize =
> 900 *
> > 1024 * 1024
> >
> >
> > Please let me know if you have trouble achieving this. Also please use
> the
> > insert operation (not bulk_insert) for this to work
> >
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > [email protected]> wrote:
> >
> > >
> > >
> > > On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote:
> > > > Hi Rahul,
> > > >
> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > > > property file to specify 100MB files. Note that this works only if
> you
> > > are
> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> sizing
> > > on
> > > > ingest time. As of now, there is no support for collapsing these file
> > > > groups (parquet + related log files) into a large file group
> (HIP/Design
> > > > may come soon). Does that help?
> > > >
> > > > Also on the compaction in general, since you don't have any updates.
> > > > I think you can simply use the copy_on_write storage? inserts will
> go to
> > > > the parquet file anyway on MOR..(but if you like to be able to deal
> with
> > > > updates later, understand where you are going)
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > > [email protected]> wrote:
> > > >
> > > > > Dear All
> > > > >
> > > > > I am using DeltaStreamer to stream the data from kafka topic and to
> > > write
> > > > > it into the hudi data set.
> > > > > For this use case I am not doing any upsert all are insert only so
> each
> > > > > job creates new parquet file after the inject job. So  large
> number of
> > > > > small files are creating. how can i  merge these files from
> > > deltastreamer
> > > > > job using the available configurations.
> > > > >
> > > > > I think compactionSmallFileSize may useful for this case,  but i
> am not
> > > > > sure whether it is for deltastreamer or not. I tried it in
> > > deltastreamer
> > > > > but it did't worked. Please assist on this. If possible give one
> > > example
> > > > > for the same
> > > > >
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > >
> > >
> > > Dear Vinoth
> > >
> > > For one of my use case , I doing only inserts.For testing i am
> inserting
> > > data which have 5-10 records only. I  am continuously pushing data to
> hudi
> > > dataset. As it is insert only for every insert it's creating  new small
> > > files to the dataset.
> > >
> > > If my insertion interval is less and i am planning for data to keep for
> > > years, this flow will create lots of small files.
> > > I just want to know whether hudi can merge these small files in any
> ways.
> > >
> > >
> > > Thanks & Regards
> > > Rahul P
> > >
> > >
> >
>
> Dear Vinoth
>
> I tried below configurations.
>
> hoodie.parquet.max.file.size=1073741824
> hoodie.parquet.small.file.limit=943718400
>
> I am using below code for inserting data from json kafka source.
>
> spark-submit --class
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class
> com.uber.hoodie.utilities.sources.JsonKafkaSource  --source-ordering-field
> stype  --target-base-path /MERGE --target-table MERGE --props
> /hudi/kafka-source.properties  --schemaprovider-class
> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert
>
> But for each insert job it's creating new parquet file. It's not touching
> old parquet files.
>
> For reference i am  sharing  some of the parquet files of hudi dataset
> which are generating as part of DeltaStreamer data insertion.
>
> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> 424.0 K
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
>
> Each job it's creating files of 424K & it's not merging any.  Can you
> please confirm whether hudi can achieve the use case which i mentioned. If
> this merging/compacting  feature is there, kindly tell what i am missing
> here.
>
> Thanks & Regards
> Rahul
>
>

Re: how to merge small parqut files in the hudi location

Reply via email to