Re: how to merge small parqut files in the hudi location

Vinoth Chandar Tue, 12 Mar 2019 16:05:11 -0700

Opened up https://github.com/uber/hudi/pull/599/files to improve this
out-of-box


On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <[email protected]> wrote:

> Hi Rahul,
>
> The files you shared all belong to same file group (they share the same
> prefix if you notice) (https://hudi.apache.org/concepts.html#terminologies
> ).
> Given its not creating new file groups every run, means the feature is
> kicking in.
>
> During each insert, Hudi will find the latest file in each file group (I,e
> the one with largest instant time, timestamp) and rewrite/expand that with
> the new inserts. Hudi does not clean up the old files immediately, since
> that can cause running queries to fail, since they could have started even
> hours ago (e.g Hive).
>
> If you want to reduce the number of files you see, you can lower number of
> commits retained
> https://hudi.apache.org/configurations.html#retainCommits
> We retain 24 by default.. i.e after the 25th file, the first one will be
> automatically cleaned..
>
> Does that make sense? Are you able to query this data and find the
> expected records?
>
> Thanks
> Vinoth
>
> On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> [email protected]> wrote:
>
>>
>>
>> On 2019/03/11 18:25:46, Vinoth Chandar <[email protected]> wrote:
>> > Hi Rahul,
>> >
>> > Hudi/Copy-on-write storage would keep expanding your existing parquet
>> files
>> > to reach the configured file size, once you set the small file size
>> > config..
>> >
>> > For e.g: we at uber, write 1GB files this way.. to do that, you could
>> set
>> > something like this.
>> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 *
>> 1024
>> > * 1024
>> > http://hudi.apache.org/configurations.html#compactionSmallFileSize =
>> 900 *
>> > 1024 * 1024
>> >
>> >
>> > Please let me know if you have trouble achieving this. Also please use
>> the
>> > insert operation (not bulk_insert) for this to work
>> >
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
>> > [email protected]> wrote:
>> >
>> > >
>> > >
>> > > On 2019/03/08 13:43:52, Vinoth Chandar <[email protected]> wrote:
>> > > > Hi Rahul,
>> > > >
>> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
>> your
>> > > > property file to specify 100MB files. Note that this works only if
>> you
>> > > are
>> > > > using insert (not bulk_insert) operation. Hudi will enforce file
>> sizing
>> > > on
>> > > > ingest time. As of now, there is no support for collapsing these
>> file
>> > > > groups (parquet + related log files) into a large file group
>> (HIP/Design
>> > > > may come soon). Does that help?
>> > > >
>> > > > Also on the compaction in general, since you don't have any updates.
>> > > > I think you can simply use the copy_on_write storage? inserts will
>> go to
>> > > > the parquet file anyway on MOR..(but if you like to be able to deal
>> with
>> > > > updates later, understand where you are going)
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Dear All
>> > > > >
>> > > > > I am using DeltaStreamer to stream the data from kafka topic and
>> to
>> > > write
>> > > > > it into the hudi data set.
>> > > > > For this use case I am not doing any upsert all are insert only
>> so each
>> > > > > job creates new parquet file after the inject job. So  large
>> number of
>> > > > > small files are creating. how can i  merge these files from
>> > > deltastreamer
>> > > > > job using the available configurations.
>> > > > >
>> > > > > I think compactionSmallFileSize may useful for this case,  but i
>> am not
>> > > > > sure whether it is for deltastreamer or not. I tried it in
>> > > deltastreamer
>> > > > > but it did't worked. Please assist on this. If possible give one
>> > > example
>> > > > > for the same
>> > > > >
>> > > > > Thanks & Regards
>> > > > > Rahul
>> > > > >
>> > > >
>> > >
>> > >
>> > > Dear Vinoth
>> > >
>> > > For one of my use case , I doing only inserts.For testing i am
>> inserting
>> > > data which have 5-10 records only. I  am continuously pushing data to
>> hudi
>> > > dataset. As it is insert only for every insert it's creating  new
>> small
>> > > files to the dataset.
>> > >
>> > > If my insertion interval is less and i am planning for data to keep
>> for
>> > > years, this flow will create lots of small files.
>> > > I just want to know whether hudi can merge these small files in any
>> ways.
>> > >
>> > >
>> > > Thanks & Regards
>> > > Rahul P
>> > >
>> > >
>> >
>>
>> Dear Vinoth
>>
>> I tried below configurations.
>>
>> hoodie.parquet.max.file.size=1073741824
>> hoodie.parquet.small.file.limit=943718400
>>
>> I am using below code for inserting data from json kafka source.
>>
>> spark-submit --class
>> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
>> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class
>> com.uber.hoodie.utilities.sources.JsonKafkaSource  --source-ordering-field
>> stype  --target-base-path /MERGE --target-table MERGE --props
>> /hudi/kafka-source.properties  --schemaprovider-class
>> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert
>>
>> But for each insert job it's creating new parquet file. It's not touching
>> old parquet files.
>>
>> For reference i am  sharing  some of the parquet files of hudi dataset
>> which are generating as part of DeltaStreamer data insertion.
>>
>> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
>> 424.0 K
>> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
>>
>> Each job it's creating files of 424K & it's not merging any.  Can you
>> please confirm whether hudi can achieve the use case which i mentioned. If
>> this merging/compacting  feature is there, kindly tell what i am missing
>> here.
>>
>> Thanks & Regards
>> Rahul
>>
>>

Re: how to merge small parqut files in the hudi location

Reply via email to