Re: how to merge small parqut files in the hudi location

rahuledavalath Tue, 12 Mar 2019 12:23:53 -0700

On 2019/03/11 18:25:46, Vinoth Chandar <[email protected]> wrote: 
> Hi Rahul,
> 
> Hudi/Copy-on-write storage would keep expanding your existing parquet files
> to reach the configured file size, once you set the small file size
> config..
> 
> For e.g: we at uber, write 1GB files this way.. to do that, you could set
> something like this.
> http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 * 1024
> * 1024
> http://hudi.apache.org/configurations.html#compactionSmallFileSize = 900 *
> 1024 * 1024
> 
> 
> Please let me know if you have trouble achieving this. Also please use the
> insert operation (not bulk_insert) for this to work
> 
> 
> Thanks
> Vinoth
> 
> On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> [email protected]> wrote:
> 
> >
> >
> > On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote:
> > > Hi Rahul,
> > >
> > > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > > property file to specify 100MB files. Note that this works only if you
> > are
> > > using insert (not bulk_insert) operation. Hudi will enforce file sizing
> > on
> > > ingest time. As of now, there is no support for collapsing these file
> > > groups (parquet + related log files) into a large file group (HIP/Design
> > > may come soon). Does that help?
> > >
> > > Also on the compaction in general, since you don't have any updates.
> > > I think you can simply use the copy_on_write storage? inserts will go to
> > > the parquet file anyway on MOR..(but if you like to be able to deal with
> > > updates later, understand where you are going)
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > > > Dear All
> > > >
> > > > I am using DeltaStreamer to stream the data from kafka topic and to
> > write
> > > > it into the hudi data set.
> > > > For this use case I am not doing any upsert all are insert only so each
> > > > job creates new parquet file after the inject job. So  large number of
> > > > small files are creating. how can i  merge these files from
> > deltastreamer
> > > > job using the available configurations.
> > > >
> > > > I think compactionSmallFileSize may useful for this case,  but i am not
> > > > sure whether it is for deltastreamer or not. I tried it in
> > deltastreamer
> > > > but it did't worked. Please assist on this. If possible give one
> > example
> > > > for the same
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> >
> >
> > Dear Vinoth
> >
> > For one of my use case , I doing only inserts.For testing i am inserting
> > data which have 5-10 records only. I  am continuously pushing data to hudi
> > dataset. As it is insert only for every insert it's creating  new small
> > files to the dataset.
> >
> > If my insertion interval is less and i am planning for data to keep for
> > years, this flow will create lots of small files.
> > I just want to know whether hudi can merge these small files in any ways.
> >
> >
> > Thanks & Regards
> > Rahul P
> >
> >
> 

Dear Vinoth

I tried below configurations.

hoodie.parquet.max.file.size=1073741824
hoodie.parquet.small.file.limit=943718400

I am using below code for inserting data from json kafka source.

spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource  --source-ordering-field 
stype  --target-base-path /MERGE --target-table MERGE --props 
/hudi/kafka-source.properties  --schemaprovider-class 
com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert

But for each insert job it's creating new parquet file. It's not touching old 
parquet files.

For reference i am  sharing  some of the parquet files of hudi dataset which 
are generating as part of DeltaStreamer data insertion.

93  /MERGE/2019/03/06/.hoodie_partition_metadata
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet

Each job it's creating files of 424K & it's not merging any.  Can you please 
confirm whether hudi can achieve the use case which i mentioned. If this 
merging/compacting  feature is there, kindly tell what i am missing here.

Thanks & Regards
Rahul
Re: how to merge small parqut files in the hudi location

Reply via email to