Hi Rahul, you can try adding hoodie.parquet.small.file.limit=104857600, to your property file to specify 100MB files. Note that this works only if you are using insert (not bulk_insert) operation. Hudi will enforce file sizing on ingest time. As of now, there is no support for collapsing these file groups (parquet + related log files) into a large file group (HIP/Design may come soon). Does that help?
Also on the compaction in general, since you don't have any updates. I think you can simply use the copy_on_write storage? inserts will go to the parquet file anyway on MOR..(but if you like to be able to deal with updates later, understand where you are going) Thanks Vinoth On Fri, Mar 8, 2019 at 3:25 AM [email protected] < [email protected]> wrote: > Dear All > > I am using DeltaStreamer to stream the data from kafka topic and to write > it into the hudi data set. > For this use case I am not doing any upsert all are insert only so each > job creates new parquet file after the inject job. So large number of > small files are creating. how can i merge these files from deltastreamer > job using the available configurations. > > I think compactionSmallFileSize may useful for this case, but i am not > sure whether it is for deltastreamer or not. I tried it in deltastreamer > but it did't worked. Please assist on this. If possible give one example > for the same > > Thanks & Regards > Rahul >
