Re: Insert will generate at least one file each time when each spark or spark streaming batch?

kaka chen Mon, 11 Mar 2019 19:43:37 -0700

Hi Vinoth,

To use this feature, I find the new file will write a new file with old
inserted records.
But how to cleanup the old files when use cow tables?


Thanks,
Frank

Vinoth Chandar <vin...@apache.org> 于2019年2月28日周四 上午3:24写道：

> Similarly, please try the 0.4.5 release. This has small file handling
> turned on by default..
>
> Also please use the insert api/operation, (not bulk_insert) if you want
> this behavior.
>
> Let us know if you still run into issues..
>
> On Tue, Feb 26, 2019 at 11:09 PM kaka chen <kaka11.c...@gmail.com> wrote:
>
> > Thanks!
> >
> > nishith agarwal <n3.nas...@gmail.com> 于2019年2月27日周三 下午2:56写道：
> >
> > > Hi Kaka,
> > >
> > > Hudi automatically does file sizing for you. As you ingest more inserts
> > the
> > > existing file will be automatically sized. You can play with a few
> > configs
> > > :
> > >
> > > https://hudi.apache.org/configurations.html#withStorageConfig -> This
> > > config allows you to set a max size for your output file.
> > > https://hudi.apache.org/configurations.html#compactionSmallFileSize ->
> > > This
> > > config allows you to set a minimum file size that will be automatically
> > > sized.
> > >
> > > As you can guess, the limitFileSize >= compactionFileSize.
> > > Hope this helps.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Tue, Feb 26, 2019 at 6:52 PM kaka chen <kaka11.c...@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I found Insert will generate at least one file each time when each
> > spark
> > > or
> > > > spark streaming batch.
> > > > Is it expected result? If it is, how to control these small files, is
> > > hudi
> > > > provide some tools to compact it?
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > >
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Reply via email to