[GitHub] [hudi] jlloh commented on issue #8651: [SUPPORT]How to resolve small file?

via GitHub Thu, 06 Jul 2023 04:22:07 -0700


jlloh commented on issue #8651:
URL: https://github.com/apache/hudi/issues/8651#issuecomment-1623499726

Sorry to hop on the thread, but @danny0405 I'm using a similar setup to OP
(Flink + Async Clustering + COW + Insert), but writing to S3 instead of HDFS.
I'm also getting small files, but I realised that the number of files written
basically correspond to the environment parallelism of the flink environment.

I tried tweaking `write.bucket_assign.tasks` but it doesn't seem to work,
also tried tweaking the parquet size configurations, but doesn't seem to take
effect. Flink still seems to write out as many files as the parallelism, in my
case parallelism is 30, so it's writing out 30 files of ~5MB each.

I guess it async clustering could solve this later on, but i saw this line
in the
[documentation](https://hudi.apache.org/docs/clustering#execute-clustering):
```
NOTE: Clustering can only be scheduled for tables / partitions not receiving
any concurrent updates. In the future, concurrent updates use-case will be
supported as well.
```
Does this mean that if a partition is currently being written to (e.g. I do
a daily partition), the clustering task won't be able to run to cluster the
files until after the day has passed and the writer stops writing to the
partition? I.e. clustering will be one day delayed.

My dag for reference below:

![image](https://github.com/apache/hudi/assets/15816131/2b338868-1980-464a-9fc7-d418e73fa0e6)

My table configuration:

![image](https://github.com/apache/hudi/assets/15816131/3c37cf03-5e13-4265-8201-28bc3c66428c)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jlloh commented on issue #8651: [SUPPORT]How to resolve small file?

Reply via email to