[GitHub] [hudi] Zouxxyy commented on pull request #7372: [HUDI-5326] Fix clustering group building in SparkSizeBasedClusteringPlanStrategy

GitBox Wed, 07 Dec 2022 09:05:23 -0800


Zouxxyy commented on PR #7372:
URL: https://github.com/apache/hudi/pull/7372#issuecomment-1341286652


   The failing UT is test `Test Call run_clustering Procedure with specific 
instants` in `TestClusteringProcedure`
   
   Before adding my patch, although the test passed. However, the calculation 
result of the orderAllFiles of the linear cluster in this example is not 20, 
but 201. There are two reasons:
   
   1. `repartitionRecords` in `RowCustomColumnsSortPartitioner` use coalesce, 
since the number of input partitions is 1, the number of output partitions is 
still 1 (although the number of cluster groups is 20)
   2. Use max.bytes.per.group to limit the size of the written parquet. It will 
split the partition in the previous step. Since the size in the memory is 
larger than the file size in the hard disk (maybe because of compression), the 
output file is 201
   
   **It can be found that the parallelism is only 1.**
   
   After adding my patch, step 2 doesn't do the splitting, it just generates 
one file, that's why my test fails.
   
   So, we need to solve the reason 1, this is the ticket, 
https://issues.apache.org/jira/browse/HUDI-5328
   
   My initial idea is to use `repartiton` instead of `coalesce`, or 
repartitonWithRange (but it will not sort)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Zouxxyy commented on pull request #7372: [HUDI-5326] Fix clustering group building in SparkSizeBasedClusteringPlanStrategy

Reply via email to