selvarajperiyasamy opened a new issue #1583:
URL: https://github.com/apache/incubator-hudi/issues/1583
I use Hudi 0.5.0. While writing COW table with below code, many small files
with 18 MB size are getting created, where as total partition size is 100MB +
Below are the 3 attempts with different configs added. For each of them, I
wipe out target folder to make sure there is nothing from existing write
causing any issue.
hdfs dfs -rmr /projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/*
**Attempt 1:**
val responseDF = replicateDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism","10").
option("hoodie.upsert.shuffle.parallelism","10").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to
change this column based on attunity config
option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one
needs to tuned as per table size
option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to
parameterized
option("hoodie.keep.min.commits",3). // TODO: This one needs to
parameterized
option("hoodie.keep.max.commits",5). // TODO: This one needs to
parameterized
option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
option(TABLE_NAME, jobConfig.targetTable).
mode(Append).
save(jobConfig.targetLocation)
20/05/01 21:17:09 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets
info => {0=BucketInfo {bucketType=INSERT,
fileIdPrefix=86f26fea-329e-45e1-95b5-00564d6123c7}, 1=BucketInfo
{bucketType=INSERT, fileIdPrefix=552da52a-68a7-4211-b705-7546d8987bdd},
2=BucketInfo {bucketType=INSERT,
fileIdPrefix=cc9a865d-e13a-4ca6-8af3-b706f6a2c963}, 3=BucketInfo
{bucketType=INSERT, fileIdPrefix=10b3d474-53c4-4f5f-b061-f33bc2b415f6},
4=BucketInfo {bucketType=INSERT,
fileIdPrefix=ccad53f2-9adf-40a9-bd73-0ab5b4137234}, 5=BucketInfo
{bucketType=INSERT, fileIdPrefix=10c7ab6a-9d65-4856-983b-438c7e3bbf44},
6=BucketInfo {bucketType=INSERT,
fileIdPrefix=65e1c010-489d-49e0-8c24-84ecb062db1d}, 7=BucketInfo
{bucketType=INSERT, fileIdPrefix=1d9a337a-e1db-45ad-af38-039d6359c70e}},
Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0,
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}],
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}],
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984},
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7,
weight=0.19999967170112984}]},
[@ ~]$ hdfs dfs -ls -h
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
Found 6 items
-rw-r--r-- 3 Hadoop_cdp 93 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/10b3d474-53c4-4f5f-b061-f33bc2b415f6-0_3-60-138_20200501211622.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/10c7ab6a-9d65-4856-983b-438c7e3bbf44-0_5-60-140_20200501211622.parquet
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1d9a337a-e1db-45ad-af38-039d6359c70e-0_7-60-142_20200501211622.parquet
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/65e1c010-489d-49e0-8c24-84ecb062db1d-0_6-60-141_20200501211622.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 21:17
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/ccad53f2-9adf-40a9-bd73-0ab5b4137234-0_4-60-139_20200501211622.parquet
**Attempt2**
val responseDF = replicateDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism","10").
option("hoodie.upsert.shuffle.parallelism","10").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to
change this column based on attunity config
option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one
needs to tuned as per table size
option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to
parameterized
option("hoodie.keep.min.commits",3). // TODO: This one needs to
parameterized
option("hoodie.keep.max.commits",5). // TODO: This one needs to
parameterized
option("hoodie.copyonwrite.insert.split.size","1000000").
option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
option(TABLE_NAME, jobConfig.targetTable).
mode(Append).
save(jobConfig.targetLocation)
20/05/01 22:05:27 INFO HoodieCopyOnWriteTable: Total insert buckets for
partition path 20200117/15 => [WorkloadStat {bucketNumber=3,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=4,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=5,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=6,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7,
weight=0.19999967170112984}]
20/05/01 22:05:27 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets
info => {0=BucketInfo {bucketType=INSERT,
fileIdPrefix=f998df6d-51e3-4a05-a024-304176ce558f}, 1=BucketInfo
{bucketType=INSERT, fileIdPrefix=d77105ff-9511-41f9-98d9-151f9336a87f},
2=BucketInfo {bucketType=INSERT,
fileIdPrefix=d50b4048-559f-42d2-a87b-7a91f59199a2}, 3=BucketInfo
{bucketType=INSERT, fileIdPrefix=1638a428-fcbb-463f-a9d8-c1e9a07995fe},
4=BucketInfo {bucketType=INSERT,
fileIdPrefix=5ebcf976-bdad-4b91-a3ee-43f2400a26b1}, 5=BucketInfo
{bucketType=INSERT, fileIdPrefix=23e9a449-fc09-4690-bd07-b7fdd1986b53},
6=BucketInfo {bucketType=INSERT,
fileIdPrefix=edef2a22-752c-4360-ae66-e29049068820}, 7=BucketInfo
{bucketType=INSERT, fileIdPrefix=1e31bb2a-e155-4535-abee-45885da67636}},
Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0,
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}],
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}],
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984},
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7,
weight=0.19999967170112984}]},
UpdateLocations mapped to buckets =>{}
[@ ~]$ hdfs dfs -ls -h
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
Found 6 items
-rw-r--r-- 3 Hadoop_cdp 93 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1638a428-fcbb-463f-a9d8-c1e9a07995fe-0_3-61-153_20200501220448.parquet
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1e31bb2a-e155-4535-abee-45885da67636-0_7-61-157_20200501220448.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/23e9a449-fc09-4690-bd07-b7fdd1986b53-0_5-61-155_20200501220448.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/5ebcf976-bdad-4b91-a3ee-43f2400a26b1-0_4-61-154_20200501220448.parquet
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 22:05
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/edef2a22-752c-4360-ae66-e29049068820-0_6-61-156_20200501220448.parquet
**Attempt 3**
val responseDF = replicateDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism","10").
option("hoodie.upsert.shuffle.parallelism","10").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to
change this column based on attunity config
option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one
needs to tuned as per table size
option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to
parameterized
option("hoodie.keep.min.commits",3). // TODO: This one needs to
parameterized
option("hoodie.keep.max.commits",5). // TODO: This one needs to
parameterized
option("hoodie.copyonwrite.insert.split.size","1000000").
option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
option(TABLE_NAME, jobConfig.targetTable).
mode(Append).
save(jobConfig.targetLocation)
20/05/01 22:23:01 INFO HoodieCopyOnWriteTable: Total insert buckets for
partition path 20200117/15 => [WorkloadStat {bucketNumber=3,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=4,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=5,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=6,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7,
weight=0.19999967170112984}]
20/05/01 22:23:01 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets
info => {0=BucketInfo {bucketType=INSERT,
fileIdPrefix=ef6c9b89-c73e-4bf8-b410-9f1f53051e03}, 1=BucketInfo
{bucketType=INSERT, fileIdPrefix=73b570a6-40a1-4544-939a-487a32e1da9d},
2=BucketInfo {bucketType=INSERT,
fileIdPrefix=958064d0-8626-4f21-b0d6-021e907c6244}, 3=BucketInfo
{bucketType=INSERT, fileIdPrefix=a2914ba2-79f1-4d49-8c70-6bfd9b1aa2c9},
4=BucketInfo {bucketType=INSERT,
fileIdPrefix=f9da5788-66af-4596-8b66-15e894ec35f7}, 5=BucketInfo
{bucketType=INSERT, fileIdPrefix=b27a4860-d3d8-4dea-89aa-6a49a99d483a},
6=BucketInfo {bucketType=INSERT,
fileIdPrefix=5db8e488-7d75-4030-b263-7d9c46068415}, 7=BucketInfo
{bucketType=INSERT, fileIdPrefix=0f739c75-71ed-462a-999d-f4c68d0331a0}},
Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0,
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}],
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}],
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984},
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6,
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7,
weight=0.19999967170112984}]},
UpdateLocations mapped to buckets =>{}
[@ ~]$ hdfs dfs -ls -h
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
Found 6 items
-rw-r--r-- 3 Hadoop_cdp 93 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/0f739c75-71ed-462a-999d-f4c68d0331a0-0_7-60-142_20200501222215.parquet
-rw-r--r-- 3 Hadoop_cdp 18.5 M 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/5db8e488-7d75-4030-b263-7d9c46068415-0_6-60-141_20200501222215.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/a2914ba2-79f1-4d49-8c70-6bfd9b1aa2c9-0_3-60-138_20200501222215.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/b27a4860-d3d8-4dea-89aa-6a49a99d483a-0_5-60-140_20200501222215.parquet
-rw-r--r-- 3 Hadoop_cdp 18.6 M 2020-05-01 22:23
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/f9da5788-66af-4596-8b66-15e894ec35f7-0_4-60-139_20200501222215.parquet
Thanks,
Selva
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]