And then I ran updates for 2000 records  for 4 times and below are the
files.

  transDetailsDF1.write.format("org.apache.hudi").
          option("hoodie.insert.shuffle.parallelism", "2").
          option("hoodie.upsert.shuffle.parallelism", "2").
          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
          option(OPERATION_OPT_KEY, "upsert").
          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
          option(TABLE_NAME, tableName).

option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
          option("hoodie.memory.merge.max.size", "2004857600000").
          option("hoodie.bloom.index.prune.by.ranges","false").
          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
          option("hoodie.cleaner.commits.retained",1).
          option("hoodie.keep.min.commits",2).
          option("hoodie.keep.max.commits",3).

option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).

option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
          option("hoodie.copyonwrite.insert.split.size","2650000").
          mode(Append).
          save(basePath);

Found 67 items
-rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet

Thanks,
Selva

On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
[email protected]> wrote:

> Hi Vinoth,
>
> I tired multiple runs. The total records expected in the
> partition is 2765125. Below is the spark-shell command.
>
> spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
>  yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> --executor-memory 40g  --num-executors 5 --executor-cores 5 --conf
> 'spark.executor.memoryOverhead=2048' --conf
> 'spark.dynamicAllocation.enabled=false' --conf
> 'spark.sql.hive.convertMetastoreParquet=false' --conf
> 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m' --
> 'spark.shuffle.service.enabled=true'
>
> Dynamic allocation set to false
>
> Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert . Below
> is the code.
>
>                 transDetailsDF1.write.format("org.apache.hudi").
>
>          option("hoodie.insert.shuffle.parallelism", "5").
>
>           option("hoodie.upsert.shuffle.parallelism", "5").
>
>          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
>           option(OPERATION_OPT_KEY, "insert").
>
>          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
>           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
>           option(TABLE_NAME, tableName).
>
>
>          
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
>           option("hoodie.memory.merge.max.size", "2004857600000").
>
>          option("hoodie.bloom.index.prune.by.ranges","false").
>
>          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
>          option("hoodie.cleaner.commits.retained",1).
>
>           option("hoodie.keep.min.commits",2).
>
>          option("hoodie.keep.max.commits",3).
>
>
>          option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
>          
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
>          option("hoodie.copyonwrite.insert.split.size","2650000").
>
>           mode(Overwrite).
>
>           save(basePath);
>
>
>
>
> Below are the files in HDFS .
>
> Found 23 items
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
>
>
>
> Attempt 2 -> Updated 10 records with Append mode and upsert key
>
>
>          transDetailsDF1.write.format("org.apache.hudi").
>
>           option("hoodie.insert.shuffle.parallelism", "5").
>
>          option("hoodie.upsert.shuffle.parallelism", "5").
>
>          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
>           option(OPERATION_OPT_KEY, "upsert").
>
>          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
>          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
>           option(TABLE_NAME, tableName).
>
>
>          
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
>          option("hoodie.memory.merge.max.size", "2004857600000").
>
>          option("hoodie.bloom.index.prune.by.ranges","false").
>
>          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
>          option("hoodie.cleaner.commits.retained",1).
>
>          option("hoodie.keep.min.commits",2).
>
>          option("hoodie.keep.max.commits",3).
>
>
>           
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
>          
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
>          option("hoodie.copyonwrite.insert.split.size","2650000").
>
>           mode(Append).
>
>           save(basePath);
>
>
>
> Found 31 items
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
>
>
>
>
>
>
>
> In both the cases, files sizes are around 15 MB.
>
>
> Thanks,
>
> Selva
>
>
>
>
>
>
>
>
> On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <[email protected]> wrote:
>
>> Hi Selva,
>>
>> Was this the first insert? Hudi handles small files by converting some
>> inserts as updates to existing files. In this case, I see just one commit
>> time, so there is nothing Hudi could optimize for.
>> If you continue making updates/inserts over time, you should see these
>> four
>> files being expanded upto the configured limits, instead of new files
>> being
>> created..
>>
>> Let me know if that helps.. Also another config to pay attention to, in
>> case of the first batch of inserts is
>> http://hudi.apache.org/docs/configurations.html#insertSplitSize
>>
>> Thanks
>> VInoth
>>
>> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
>> [email protected]> wrote:
>>
>> > Below are the few files.
>> >
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>> >
>> >
>> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
>> > [email protected]> wrote:
>> >
>> > > Team,
>> > >
>> > > I am using Hudi 0.5.0. While writing COW table with below code, many
>> > small
>> > > files with 15 MB size are getting created, where as total partition
>> size
>> > is
>> > > 300MB +
>> > >
>> > >   val output = transDetailsDF.write.format("org.apache.hudi").
>> > >           option("hoodie.insert.shuffle.parallelism", "2").
>> > >           option("hoodie.upsert.shuffle.parallelism", "2").
>> > >
>>  option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>> > >           option(OPERATION_OPT_KEY, "upsert").
>> > >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>> > >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>> > >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>> > >           option(TABLE_NAME, tableName).
>> > >
>> > >
>> >
>> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>> > >           option("hoodie.memory.merge.max.size", "2004857600000").
>> > >           option("hoodie.bloom.index.prune.by.ranges","false").
>> > >           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>> > >           option("hoodie.cleaner.commits.retained", 2).
>> > >           option("hoodie.keep.min.commits",3).
>> > >           option("hoodie.keep.max.commits",5).
>> > >
>> > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>> > >
>> > >
>> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>> > >           mode(Append).
>> > >           save(basePath);
>> > > As per instruction provided in
>> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
>> > compactionSmallFileSize
>> > > to 100 MB and limitFileSize to 128 .
>> > >
>> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
>> created.
>> > >
>> > > Am I missing any config here?
>> > >
>> > > Thanks,
>> > > Selva
>> > >
>> >
>>
>

Reply via email to