Hi Rahul,

Hudi/Copy-on-write storage would keep expanding your existing parquet files
to reach the configured file size, once you set the small file size
config..

For e.g: we at uber, write 1GB files this way.. to do that, you could set
something like this.
http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 * 1024
* 1024
http://hudi.apache.org/configurations.html#compactionSmallFileSize = 900 *
1024 * 1024


Please let me know if you have trouble achieving this. Also please use the
insert operation (not bulk_insert) for this to work


Thanks
Vinoth

On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote:
> > Hi Rahul,
> >
> > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > property file to specify 100MB files. Note that this works only if you
> are
> > using insert (not bulk_insert) operation. Hudi will enforce file sizing
> on
> > ingest time. As of now, there is no support for collapsing these file
> > groups (parquet + related log files) into a large file group (HIP/Design
> > may come soon). Does that help?
> >
> > Also on the compaction in general, since you don't have any updates.
> > I think you can simply use the copy_on_write storage? inserts will go to
> > the parquet file anyway on MOR..(but if you like to be able to deal with
> > updates later, understand where you are going)
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > [email protected]> wrote:
> >
> > > Dear All
> > >
> > > I am using DeltaStreamer to stream the data from kafka topic and to
> write
> > > it into the hudi data set.
> > > For this use case I am not doing any upsert all are insert only so each
> > > job creates new parquet file after the inject job. So  large number of
> > > small files are creating. how can i  merge these files from
> deltastreamer
> > > job using the available configurations.
> > >
> > > I think compactionSmallFileSize may useful for this case,  but i am not
> > > sure whether it is for deltastreamer or not. I tried it in
> deltastreamer
> > > but it did't worked. Please assist on this. If possible give one
> example
> > > for the same
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> >
>
>
> Dear Vinoth
>
> For one of my use case , I doing only inserts.For testing i am inserting
> data which have 5-10 records only. I  am continuously pushing data to hudi
> dataset. As it is insert only for every insert it's creating  new small
> files to the dataset.
>
> If my insertion interval is less and i am planning for data to keep for
> years, this flow will create lots of small files.
> I just want to know whether hudi can merge these small files in any ways.
>
>
> Thanks & Regards
> Rahul P
>
>

Reply via email to