[GitHub] [hudi] codope commented on pull request #3142: [HUDI-1483] Support async clustering for deltastreamer and Spark streaming

GitBox Tue, 29 Jun 2021 06:19:40 -0700


codope commented on pull request #3142:
URL: https://github.com/apache/hudi/pull/3142#issuecomment-869829116



   @nsivabalan Thanks for reviewing the PR. I agree with your source code 
comments. There is scope for reusability. I will address them and update the 
PR. For the high level questions, my response is as below.
    
   > * Now we have both clustering and compaction, I see that you have added 
clustering related code just after compaction where ever applicable. Is the 
higher priority for compaction intentional? or should we have clustering 
followed by compaction? or does it not matter at all.
   
   In case when both clustering and compaction are enabled then compaction will 
run just before clustering. The intention is that since currently compaction 
and clustering cannot run at the same time on the same file groups and 
clustering could take significant time, so let compaction thread start first. 
When clustering is scheduled for the filegroups under compaction it would be 
ignored and picked up in the subsequent run after compaction completes.
   
   > * I came across a class named SchedulerConfGenerator. Don't we need to 
make any changes here for async clustering?
   
   We will need to make changes here if we create separate job pool for 
clustering and assign weights for different jobs. Unlike compaction, I did not 
feel the need for a separate job pool for clustering. By default, each pool 
gets equal share of resource but within each pool, jobs run in FIFO order. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on pull request #3142: [HUDI-1483] Support async clustering for deltastreamer and Spark streaming

Reply via email to