KnightChess commented on PR #7119:
URL: https://github.com/apache/hudi/pull/7119#issuecomment-1316382216
@xushiyan sorry for the late reply, I have update the impact.
For clustering job, I think there two main functions
- optimize data layout by merge file size to target file size
- optimize data layout by sort
And if I use upsert, I will auto solve small file quesion when tag, the file
size will bigger than `hoodie.clustering.plan.strategy.small.file.limit`.
Now, I want to optimize data layout to make `col stat` more efficiency, and
do not change the default file size. Only want to sort data with specified
field. But the fille will all be filted by
`hoodie.clustering.plan.strategy.small.file.limit`, and can not optimize data
by sort, unless set the conf more bigger than max file size to make sure all
file will be include. I think this conf is only used to achieve the function
`merge small file to bigger`.
Considering this strategy : `SparkSingleFileSortExecutionStrategy`, it is
only support `optimize data layout by sort` function, I think should not be
limited by `hoodie.clustering.plan.strategy.small.file.limit`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]