[GitHub] [hudi] KnightChess commented on pull request #7119: [HUDI-5149] fix spark single file sort plan can not work

GitBox Tue, 15 Nov 2022 21:28:56 -0800


KnightChess commented on PR #7119:
URL: https://github.com/apache/hudi/pull/7119#issuecomment-1316382216


   @xushiyan sorry for the late reply, I have update the impact.
    For clustering job, I think there two main functions
   - optimize data layout by merge file size to target file size
   - optimize data layout by sort
   
   And if I use upsert, I will auto solve small file quesion when tag, the file 
size will bigger than `hoodie.clustering.plan.strategy.small.file.limit`.  
   Now, I want to optimize data layout to make `col stat` more efficiency, and 
do not change the default file size. Only want to sort data with specified 
field. But the fille will all be filted by 
`hoodie.clustering.plan.strategy.small.file.limit`, and can not optimize data 
by sort, unless set the conf more bigger than max file size to make sure all 
file will be include. I think this conf is only used to achieve the function 
`merge small file to bigger`.
   
   Considering this strategy : `SparkSingleFileSortExecutionStrategy`, it is 
only support `optimize data layout by sort` function, I think should not be 
limited by `hoodie.clustering.plan.strategy.small.file.limit`. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] KnightChess commented on pull request #7119: [HUDI-5149] fix spark single file sort plan can not work

Reply via email to