jlloh commented on issue #8651:
URL: https://github.com/apache/hudi/issues/8651#issuecomment-1623499726

   Sorry to hop on the thread, but @danny0405 I'm using a similar setup to OP 
(Flink + Async Clustering + COW + Insert), but writing to S3 instead of HDFS. 
I'm also getting small files, but I realised that the number of files written 
basically correspond to the environment parallelism of the flink environment.
   
   I tried tweaking `write.bucket_assign.tasks` but it doesn't seem to work, 
also tried tweaking the parquet size configurations, but doesn't seem to take 
effect. Flink still seems to write out as many files as the parallelism, in my 
case parallelism is 30, so it's writing out 30 files of ~5MB each. 
   
   I guess it async clustering could solve this later on, but i saw this line 
in the 
[documentation](https://hudi.apache.org/docs/clustering#execute-clustering):
   ```
   NOTE: Clustering can only be scheduled for tables / partitions not receiving 
any concurrent updates. In the future, concurrent updates use-case will be 
supported as well.
   ```
   Does this mean that if a partition is currently being written to (e.g. I do 
a daily partition), the clustering task won't be able to run to cluster the 
files until after the day has passed and the writer stops writing to the 
partition? I.e. clustering will be one day delayed.
   
   My dag for reference below:
   
![image](https://github.com/apache/hudi/assets/15816131/2b338868-1980-464a-9fc7-d418e73fa0e6)
   
   My table configuration:
   
![image](https://github.com/apache/hudi/assets/15816131/3c37cf03-5e13-4265-8201-28bc3c66428c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to