Hi Selva, Was this the first insert? Hudi handles small files by converting some inserts as updates to existing files. In this case, I see just one commit time, so there is nothing Hudi could optimize for. If you continue making updates/inserts over time, you should see these four files being expanded upto the configured limits, instead of new files being created..
Let me know if that helps.. Also another config to pay attention to, in case of the first batch of inserts is http://hudi.apache.org/docs/configurations.html#insertSplitSize Thanks VInoth On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy < [email protected]> wrote: > Below are the few files. > > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09 > > /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09 > > /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09 > > /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09 > > /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet > > > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy < > [email protected]> wrote: > > > Team, > > > > I am using Hudi 0.5.0. While writing COW table with below code, many > small > > files with 15 MB size are getting created, where as total partition size > is > > 300MB + > > > > val output = transDetailsDF.write.format("org.apache.hudi"). > > option("hoodie.insert.shuffle.parallelism", "2"). > > option("hoodie.upsert.shuffle.parallelism", "2"). > > option("hoodie.datasource.write.table.type","COPY_ON_WRITE"). > > option(OPERATION_OPT_KEY, "upsert"). > > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date"). > > option(RECORDKEY_FIELD_OPT_KEY,"record_key"). > > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > > option(TABLE_NAME, tableName). > > > > > option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom"). > > option("hoodie.memory.merge.max.size", "2004857600000"). > > option("hoodie.bloom.index.prune.by.ranges","false"). > > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS"). > > option("hoodie.cleaner.commits.retained", 2). > > option("hoodie.keep.min.commits",3). > > option("hoodie.keep.max.commits",5). > > > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)). > > > > option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)). > > mode(Append). > > save(basePath); > > As per instruction provided in > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set > compactionSmallFileSize > > to 100 MB and limitFileSize to 128 . > > > > Hadoop block size is 256 MB , I am looking for 128 MB files are created. > > > > Am I missing any config here? > > > > Thanks, > > Selva > > >
