Re: Small Files

Vinoth Chandar Sun, 15 Mar 2020 23:17:21 -0700

Hi Selva,

Was this the first insert? Hudi handles small files by converting some
inserts as updates to existing files. In this case, I see just one commit
time, so there is nothing Hudi could optimize for.
If you continue making updates/inserts over time, you should see these four
files being expanded upto the configured limits, instead of new files being
created..


Let me know if that helps.. Also another config to pay attention to, in
case of the first batch of inserts is
http://hudi.apache.org/docs/configurations.html#insertSplitSize

Thanks
VInoth

On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
[email protected]> wrote:

> Below are the few files.
>
> -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>
>
> On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> [email protected]> wrote:
>
> > Team,
> >
> > I am using Hudi 0.5.0. While writing COW table with below code, many
> small
> > files with 15 MB size are getting created, where as total partition size
> is
> > 300MB +
> >
> >   val output = transDetailsDF.write.format("org.apache.hudi").
> >           option("hoodie.insert.shuffle.parallelism", "2").
> >           option("hoodie.upsert.shuffle.parallelism", "2").
> >           option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >           option(OPERATION_OPT_KEY, "upsert").
> >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >           option(TABLE_NAME, tableName).
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >           option("hoodie.memory.merge.max.size", "2004857600000").
> >           option("hoodie.bloom.index.prune.by.ranges","false").
> >           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >           option("hoodie.cleaner.commits.retained", 2).
> >           option("hoodie.keep.min.commits",3).
> >           option("hoodie.keep.max.commits",5).
> >
> > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> > option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >           mode(Append).
> >           save(basePath);
> > As per instruction provided in
> > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> compactionSmallFileSize
> > to 100 MB and limitFileSize to 128 .
> >
> > Hadoop block size is 256 MB , I am looking for 128 MB files are created.
> >
> > Am I missing any config here?
> >
> > Thanks,
> > Selva
> >
>

Re: Small Files

Reply via email to