On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote:
> Hi Rahul,
>
> you can try adding hoodie.parquet.small.file.limit=104857600, to your
> property file to specify 100MB files. Note that this works only if you are
> using insert (not bulk_insert) operation. Hudi will enforce file sizing on
> ingest time. As of now, there is no support for collapsing these file
> groups (parquet + related log files) into a large file group (HIP/Design
> may come soon). Does that help?
>
> Also on the compaction in general, since you don't have any updates.
> I think you can simply use the copy_on_write storage? inserts will go to
> the parquet file anyway on MOR..(but if you like to be able to deal with
> updates later, understand where you are going)
>
> Thanks
> Vinoth
>
> On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> [email protected]> wrote:
>
> > Dear All
> >
> > I am using DeltaStreamer to stream the data from kafka topic and to write
> > it into the hudi data set.
> > For this use case I am not doing any upsert all are insert only so each
> > job creates new parquet file after the inject job. So large number of
> > small files are creating. how can i merge these files from deltastreamer
> > job using the available configurations.
> >
> > I think compactionSmallFileSize may useful for this case, but i am not
> > sure whether it is for deltastreamer or not. I tried it in deltastreamer
> > but it did't worked. Please assist on this. If possible give one example
> > for the same
> >
> > Thanks & Regards
> > Rahul
> >
>
Dear Vinoth
For one of my use case , I doing only inserts.For testing i am inserting data
which have 5-10 records only. I am continuously pushing data to hudi dataset.
As it is insert only for every insert it's creating new small files to the
dataset.
If my insertion interval is less and i am planning for data to keep for years,
this flow will create lots of small files.
I just want to know whether hudi can merge these small files in any ways.
Thanks & Regards
Rahul P