Hi Rahul, Hudi/Copy-on-write storage would keep expanding your existing parquet files to reach the configured file size, once you set the small file size config..
For e.g: we at uber, write 1GB files this way.. to do that, you could set something like this. http://hudi.apache.org/configurations.html#limitFileSize = 1 * 1024 * 1024 * 1024 http://hudi.apache.org/configurations.html#compactionSmallFileSize = 900 * 1024 * 1024 Please let me know if you have trouble achieving this. Also please use the insert operation (not bulk_insert) for this to work Thanks Vinoth On Mon, Mar 11, 2019 at 12:32 AM [email protected] < [email protected]> wrote: > > > On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote: > > Hi Rahul, > > > > you can try adding hoodie.parquet.small.file.limit=104857600, to your > > property file to specify 100MB files. Note that this works only if you > are > > using insert (not bulk_insert) operation. Hudi will enforce file sizing > on > > ingest time. As of now, there is no support for collapsing these file > > groups (parquet + related log files) into a large file group (HIP/Design > > may come soon). Does that help? > > > > Also on the compaction in general, since you don't have any updates. > > I think you can simply use the copy_on_write storage? inserts will go to > > the parquet file anyway on MOR..(but if you like to be able to deal with > > updates later, understand where you are going) > > > > Thanks > > Vinoth > > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] < > > [email protected]> wrote: > > > > > Dear All > > > > > > I am using DeltaStreamer to stream the data from kafka topic and to > write > > > it into the hudi data set. > > > For this use case I am not doing any upsert all are insert only so each > > > job creates new parquet file after the inject job. So large number of > > > small files are creating. how can i merge these files from > deltastreamer > > > job using the available configurations. > > > > > > I think compactionSmallFileSize may useful for this case, but i am not > > > sure whether it is for deltastreamer or not. I tried it in > deltastreamer > > > but it did't worked. Please assist on this. If possible give one > example > > > for the same > > > > > > Thanks & Regards > > > Rahul > > > > > > > > Dear Vinoth > > For one of my use case , I doing only inserts.For testing i am inserting > data which have 5-10 records only. I am continuously pushing data to hudi > dataset. As it is insert only for every insert it's creating new small > files to the dataset. > > If my insertion interval is less and i am planning for data to keep for > years, this flow will create lots of small files. > I just want to know whether hudi can merge these small files in any ways. > > > Thanks & Regards > Rahul P > >
