selvarajperiyasamy opened a new issue #1583:
URL: https://github.com/apache/incubator-hudi/issues/1583


   
   I use Hudi 0.5.0. While writing COW table with below code, many small files 
with 18 MB size are getting created, where as total partition size is 100MB +
   
    
   
   Below are the 3 attempts with different configs added.  For each of them, I 
wipe out target folder to make sure there is nothing from existing write 
causing any issue.
   
    
   
   hdfs dfs -rmr /projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/*
   
    
   
    
   
    
   
   **Attempt 1:**
   
    
   
   val responseDF = replicateDF.write.format("org.apache.hudi").
   
     option("hoodie.insert.shuffle.parallelism","10").
   
     option("hoodie.upsert.shuffle.parallelism","10").
   
     option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
   
     option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
   
     option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to 
change this column based on attunity config
   
     option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one 
needs to tuned as per table size
   
     option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
   
     option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
   
     
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
   
     option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
   
     option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.min.commits",3). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.max.commits",5). // TODO: This one needs to 
parameterized
   
     option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
   
     option(TABLE_NAME, jobConfig.targetTable).
   
     mode(Append).
   
     save(jobConfig.targetLocation)
   
    
   
   20/05/01 21:17:09 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets 
info => {0=BucketInfo {bucketType=INSERT, 
fileIdPrefix=86f26fea-329e-45e1-95b5-00564d6123c7}, 1=BucketInfo 
{bucketType=INSERT, fileIdPrefix=552da52a-68a7-4211-b705-7546d8987bdd}, 
2=BucketInfo {bucketType=INSERT, 
fileIdPrefix=cc9a865d-e13a-4ca6-8af3-b706f6a2c963}, 3=BucketInfo 
{bucketType=INSERT, fileIdPrefix=10b3d474-53c4-4f5f-b061-f33bc2b415f6}, 
4=BucketInfo {bucketType=INSERT, 
fileIdPrefix=ccad53f2-9adf-40a9-bd73-0ab5b4137234}, 5=BucketInfo 
{bucketType=INSERT, fileIdPrefix=10c7ab6a-9d65-4856-983b-438c7e3bbf44}, 
6=BucketInfo {bucketType=INSERT, 
fileIdPrefix=65e1c010-489d-49e0-8c24-84ecb062db1d}, 7=BucketInfo 
{bucketType=INSERT, fileIdPrefix=1d9a337a-e1db-45ad-af38-039d6359c70e}},
   
   Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0, 
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}], 
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}], 
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984}, 
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat 
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7, 
weight=0.19999967170112984}]},
   
    
   
   [@ ~]$ hdfs dfs -ls -h 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
   
   Found 6 items
   
   -rw-r--r--   3  Hadoop_cdp         93 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/10b3d474-53c4-4f5f-b061-f33bc2b415f6-0_3-60-138_20200501211622.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/10c7ab6a-9d65-4856-983b-438c7e3bbf44-0_5-60-140_20200501211622.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1d9a337a-e1db-45ad-af38-039d6359c70e-0_7-60-142_20200501211622.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/65e1c010-489d-49e0-8c24-84ecb062db1d-0_6-60-141_20200501211622.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 21:17 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/ccad53f2-9adf-40a9-bd73-0ab5b4137234-0_4-60-139_20200501211622.parquet
   
    
   
    
   
   **Attempt2**
   
    
   
   val responseDF = replicateDF.write.format("org.apache.hudi").
   
     option("hoodie.insert.shuffle.parallelism","10").
   
     option("hoodie.upsert.shuffle.parallelism","10").
   
     option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
   
     option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
   
     option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to 
change this column based on attunity config
   
     option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one 
needs to tuned as per table size
   
     option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
   
     option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
   
     
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
   
     option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
   
     option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.min.commits",3). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.max.commits",5). // TODO: This one needs to 
parameterized
   
     option("hoodie.copyonwrite.insert.split.size","1000000").
   
     option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
   
     option(TABLE_NAME, jobConfig.targetTable).
   
     mode(Append).
   
     save(jobConfig.targetLocation)
   
    
   
    
   
    
   
   20/05/01 22:05:27 INFO HoodieCopyOnWriteTable: Total insert buckets for 
partition path 20200117/15 => [WorkloadStat {bucketNumber=3, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=4, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=5, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=6, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7, 
weight=0.19999967170112984}]
   
   20/05/01 22:05:27 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets 
info => {0=BucketInfo {bucketType=INSERT, 
fileIdPrefix=f998df6d-51e3-4a05-a024-304176ce558f}, 1=BucketInfo 
{bucketType=INSERT, fileIdPrefix=d77105ff-9511-41f9-98d9-151f9336a87f}, 
2=BucketInfo {bucketType=INSERT, 
fileIdPrefix=d50b4048-559f-42d2-a87b-7a91f59199a2}, 3=BucketInfo 
{bucketType=INSERT, fileIdPrefix=1638a428-fcbb-463f-a9d8-c1e9a07995fe}, 
4=BucketInfo {bucketType=INSERT, 
fileIdPrefix=5ebcf976-bdad-4b91-a3ee-43f2400a26b1}, 5=BucketInfo 
{bucketType=INSERT, fileIdPrefix=23e9a449-fc09-4690-bd07-b7fdd1986b53}, 
6=BucketInfo {bucketType=INSERT, 
fileIdPrefix=edef2a22-752c-4360-ae66-e29049068820}, 7=BucketInfo 
{bucketType=INSERT, fileIdPrefix=1e31bb2a-e155-4535-abee-45885da67636}},
   
   Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0, 
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}], 
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}], 
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984}, 
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat 
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7, 
weight=0.19999967170112984}]},
   
   UpdateLocations mapped to buckets =>{}
   
    
   
    
   
   [@ ~]$ hdfs dfs -ls -h 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
   
   Found 6 items
   
   -rw-r--r--   3  Hadoop_cdp         93 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1638a428-fcbb-463f-a9d8-c1e9a07995fe-0_3-61-153_20200501220448.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/1e31bb2a-e155-4535-abee-45885da67636-0_7-61-157_20200501220448.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/23e9a449-fc09-4690-bd07-b7fdd1986b53-0_5-61-155_20200501220448.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/5ebcf976-bdad-4b91-a3ee-43f2400a26b1-0_4-61-154_20200501220448.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 22:05 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/edef2a22-752c-4360-ae66-e29049068820-0_6-61-156_20200501220448.parquet
   
    
   
    
   
   **Attempt 3**
   
    
   
   val responseDF = replicateDF.write.format("org.apache.hudi").
   
     option("hoodie.insert.shuffle.parallelism","10").
   
     option("hoodie.upsert.shuffle.parallelism","10").
   
     option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
   
     option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
   
     option(PRECOMBINE_FIELD_OPT_KEY,"oracle_commit_ts"). //TODO : Need to 
change this column based on attunity config
   
     option("hoodie.memory.merge.max.size", "2004857600000"). // TODO: This one 
needs to tuned as per table size
   
     option(PARTITIONPATH_FIELD_OPT_KEY,jobConfig.partitionDerivation).
   
     option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
   
     
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
   
     option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
   
     option("hoodie.cleaner.commits.retained", 2). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.min.commits",3). // TODO: This one needs to 
parameterized
   
     option("hoodie.keep.max.commits",5). // TODO: This one needs to 
parameterized
   
     option("hoodie.copyonwrite.insert.split.size","1000000").
   
     option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
   
     option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
   
     option(RECORDKEY_FIELD_OPT_KEY,jobConfig.uniqueIndexColumns).
   
     option(TABLE_NAME, jobConfig.targetTable).
   
     mode(Append).
   
     save(jobConfig.targetLocation)
   
    
   
   20/05/01 22:23:01 INFO HoodieCopyOnWriteTable: Total insert buckets for 
partition path 20200117/15 => [WorkloadStat {bucketNumber=3, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=4, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=5, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=6, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7, 
weight=0.19999967170112984}]
   
   20/05/01 22:23:01 INFO HoodieCopyOnWriteTable: Total Buckets :8, buckets 
info => {0=BucketInfo {bucketType=INSERT, 
fileIdPrefix=ef6c9b89-c73e-4bf8-b410-9f1f53051e03}, 1=BucketInfo 
{bucketType=INSERT, fileIdPrefix=73b570a6-40a1-4544-939a-487a32e1da9d}, 
2=BucketInfo {bucketType=INSERT, 
fileIdPrefix=958064d0-8626-4f21-b0d6-021e907c6244}, 3=BucketInfo 
{bucketType=INSERT, fileIdPrefix=a2914ba2-79f1-4d49-8c70-6bfd9b1aa2c9}, 
4=BucketInfo {bucketType=INSERT, 
fileIdPrefix=f9da5788-66af-4596-8b66-15e894ec35f7}, 5=BucketInfo 
{bucketType=INSERT, fileIdPrefix=b27a4860-d3d8-4dea-89aa-6a49a99d483a}, 
6=BucketInfo {bucketType=INSERT, 
fileIdPrefix=5db8e488-7d75-4030-b263-7d9c46068415}, 7=BucketInfo 
{bucketType=INSERT, fileIdPrefix=0f739c75-71ed-462a-999d-f4c68d0331a0}},
   
   Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0, 
weight=1.0}], 20200117/14=[WorkloadStat {bucketNumber=1, weight=1.0}], 
20200117/07=[WorkloadStat {bucketNumber=2, weight=1.0}], 
20200117/15=[WorkloadStat {bucketNumber=3, weight=0.19999967170112984}, 
WorkloadStat {bucketNumber=4, weight=0.19999967170112984}, WorkloadStat 
{bucketNumber=5, weight=0.19999967170112984}, WorkloadStat {bucketNumber=6, 
weight=0.19999967170112984}, WorkloadStat {bucketNumber=7, 
weight=0.19999967170112984}]},
   
   UpdateLocations mapped to buckets =>{}
   
    
   
   [@ ~]$ hdfs dfs -ls -h 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15
   
   Found 6 items
   
   -rw-r--r--   3  Hadoop_cdp         93 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/.hoodie_partition_metadata
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/0f739c75-71ed-462a-999d-f4c68d0331a0-0_7-60-142_20200501222215.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.5 M 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/5db8e488-7d75-4030-b263-7d9c46068415-0_6-60-141_20200501222215.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/a2914ba2-79f1-4d49-8c70-6bfd9b1aa2c9-0_3-60-138_20200501222215.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/b27a4860-d3d8-4dea-89aa-6a49a99d483a-0_5-60-140_20200501222215.parquet
   
   -rw-r--r--   3  Hadoop_cdp     18.6 M 2020-05-01 22:23 
/projects/cdp/data/attunity_poc/ptdb_PAYMENT_TRANSACTION/20200117/15/f9da5788-66af-4596-8b66-15e894ec35f7-0_4-60-139_20200501222215.parquet
   
    
   
    
   
    
   
   Thanks,
   
   Selva 
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to