Hi Selva,
Hudi has a CLI which will summarize each commit nicely.. Can you also
provide output from that? it will tell you how many files are
created/updated etc
http://hudi.apache.org/docs/deployment.html#inspecting-commits
2765125 records in the initial batch is getting split into 2.7M/500K
partitions during writing (to get parallel write performance) as per the
config I pointed out before.. However this is not as high as 20, the amount
of files you are getting.. Can you share the driver logs around the
statemen below for the initial commit (HoodieCopyOnWrite#UpsertPartitioner
is what we want).. We can open a github issue if it makes it easier to
share logs/code etc..
LOG.info("Total Buckets :" + totalBuckets + ", buckets info => " +
bucketInfoMap + ", \n"
+ "Partition to insert buckets => " + partitionPathToInsertBuckets + ", \n"
+ "UpdateLocations mapped to buckets =>" + updateLocationToBucket);
Aside from that, I sample a file id in later commits, it does seem like the
it's getting re-written as expected.. So if we understand why you have 20
files to begin with we can go from there
On Mon, Mar 16, 2020 at 12:48 AM selvaraj periyasamy <
[email protected]> wrote:
> And then I ran updates for 2000 records for 4 times and below are the
> files.
>
> transDetailsDF1.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "2").
> option("hoodie.upsert.shuffle.parallelism", "2").
> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> option(OPERATION_OPT_KEY, "upsert").
> option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> option(TABLE_NAME, tableName).
>
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> option("hoodie.memory.merge.max.size", "2004857600000").
> option("hoodie.bloom.index.prune.by.ranges","false").
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> option("hoodie.cleaner.commits.retained",1).
> option("hoodie.keep.min.commits",2).
> option("hoodie.keep.max.commits",3).
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> option("hoodie.copyonwrite.insert.split.size","2650000").
> mode(Append).
> save(basePath);
>
> Found 67 items
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
>
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet
>
> Thanks,
> Selva
>
> On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
> [email protected]> wrote:
>
> > Hi Vinoth,
> >
> > I tired multiple runs. The total records expected in the
> > partition is 2765125. Below is the spark-shell command.
> >
> > spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> > 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
> > yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> > --executor-memory 40g --num-executors 5 --executor-cores 5 --conf
> > 'spark.executor.memoryOverhead=2048' --conf
> > 'spark.dynamicAllocation.enabled=false' --conf
> > 'spark.sql.hive.convertMetastoreParquet=false' --conf
> > 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m'
> --
> > 'spark.shuffle.service.enabled=true'
> >
> > Dynamic allocation set to false
> >
> > Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert .
> Below
> > is the code.
> >
> > transDetailsDF1.write.format("org.apache.hudi").
> >
> > option("hoodie.insert.shuffle.parallelism", "5").
> >
> > option("hoodie.upsert.shuffle.parallelism", "5").
> >
> > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> > option(OPERATION_OPT_KEY, "insert").
> >
> > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> > option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> > option("hoodie.memory.merge.max.size", "2004857600000").
> >
> > option("hoodie.bloom.index.prune.by.ranges","false").
> >
> > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> > option("hoodie.cleaner.commits.retained",1).
> >
> > option("hoodie.keep.min.commits",2).
> >
> > option("hoodie.keep.max.commits",3).
> >
> >
> >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> > option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> > mode(Overwrite).
> >
> > save(basePath);
> >
> >
> >
> >
> > Below are the files in HDFS .
> >
> > Found 23 items
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> >
> >
> >
> > Attempt 2 -> Updated 10 records with Append mode and upsert key
> >
> >
> > transDetailsDF1.write.format("org.apache.hudi").
> >
> > option("hoodie.insert.shuffle.parallelism", "5").
> >
> > option("hoodie.upsert.shuffle.parallelism", "5").
> >
> > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> > option(OPERATION_OPT_KEY, "upsert").
> >
> > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> > option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> > option("hoodie.memory.merge.max.size", "2004857600000").
> >
> > option("hoodie.bloom.index.prune.by.ranges","false").
> >
> > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> > option("hoodie.cleaner.commits.retained",1).
> >
> > option("hoodie.keep.min.commits",2).
> >
> > option("hoodie.keep.max.commits",3).
> >
> >
> >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> > option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> > mode(Append).
> >
> > save(basePath);
> >
> >
> >
> > Found 31 items
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
> >
> >
> >
> >
> >
> >
> >
> > In both the cases, files sizes are around 15 MB.
> >
> >
> > Thanks,
> >
> > Selva
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> >> Hi Selva,
> >>
> >> Was this the first insert? Hudi handles small files by converting some
> >> inserts as updates to existing files. In this case, I see just one
> commit
> >> time, so there is nothing Hudi could optimize for.
> >> If you continue making updates/inserts over time, you should see these
> >> four
> >> files being expanded upto the configured limits, instead of new files
> >> being
> >> created..
> >>
> >> Let me know if that helps.. Also another config to pay attention to, in
> >> case of the first batch of inserts is
> >> http://hudi.apache.org/docs/configurations.html#insertSplitSize
> >>
> >> Thanks
> >> VInoth
> >>
> >> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
> >> [email protected]> wrote:
> >>
> >> > Below are the few files.
> >> >
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
> >> >
> >> >
> >> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> >> > [email protected]> wrote:
> >> >
> >> > > Team,
> >> > >
> >> > > I am using Hudi 0.5.0. While writing COW table with below code, many
> >> > small
> >> > > files with 15 MB size are getting created, where as total partition
> >> size
> >> > is
> >> > > 300MB +
> >> > >
> >> > > val output = transDetailsDF.write.format("org.apache.hudi").
> >> > > option("hoodie.insert.shuffle.parallelism", "2").
> >> > > option("hoodie.upsert.shuffle.parallelism", "2").
> >> > >
> >> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >> > > option(OPERATION_OPT_KEY, "upsert").
> >> > > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >> > > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >> > > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >> > > option(TABLE_NAME, tableName).
> >> > >
> >> > >
> >> >
> >>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >> > > option("hoodie.memory.merge.max.size", "2004857600000").
> >> > > option("hoodie.bloom.index.prune.by.ranges","false").
> >> > >
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >> > > option("hoodie.cleaner.commits.retained", 2).
> >> > > option("hoodie.keep.min.commits",3).
> >> > > option("hoodie.keep.max.commits",5).
> >> > >
> >> > >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >> > >
> >> > >
> >> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >> > > mode(Append).
> >> > > save(basePath);
> >> > > As per instruction provided in
> >> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> >> > compactionSmallFileSize
> >> > > to 100 MB and limitFileSize to 128 .
> >> > >
> >> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
> >> created.
> >> > >
> >> > > Am I missing any config here?
> >> > >
> >> > > Thanks,
> >> > > Selva
> >> > >
> >> >
> >>
> >
>