Zouxxyy commented on PR #7372: URL: https://github.com/apache/hudi/pull/7372#issuecomment-1341286652
The failing UT is test `Test Call run_clustering Procedure with specific instants` in `TestClusteringProcedure` Before adding my patch, although the test passed. However, the calculation result of the orderAllFiles of the linear cluster in this example is not 20, but 201. There are two reasons: 1. `repartitionRecords` in `RowCustomColumnsSortPartitioner` use coalesce, since the number of input partitions is 1, the number of output partitions is still 1 (although the number of cluster groups is 20) 2. Use max.bytes.per.group to limit the size of the written parquet. It will split the partition in the previous step. Since the size in the memory is larger than the file size in the hard disk (maybe because of compression), the output file is 201 **It can be found that the parallelism is only 1.** After adding my patch, step 2 doesn't do the splitting, it just generates one file, that's why my test fails. So, we need to solve the reason 1, this is the ticket, https://issues.apache.org/jira/browse/HUDI-5328 My initial idea is to use `repartiton` instead of `coalesce`, or repartitonWithRange (but it will not sort) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
