nsivabalan edited a comment on issue #3892:
URL: https://github.com/apache/hudi/issues/3892#issuecomment-956756527


   Let me try to explain. @bhasudha : Can you document this somewhere. might be 
useful for everyone in the community in general. 
   
   Bulk_insert: 
   This does not do any small file handling. 
   And so, solely relies on  
https://hudi.apache.org/docs/configurations/#hoodieparquetmaxfilesize and 
parallelism set for bulk_insert. 
   
   Insert: 
   Will do small file handling and could bin back incoming records to existing 
files. 
   For first commit for a hudi table, hudi does not have any idea of the record 
size. and so it relies on 
HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() to determine 
how many might got into one data file. 
   
   In subsequent commits, hudi will infer the record size from previous commits 
and will use that to do small file handling. but still tries to honor max file 
size based on 
https://hudi.apache.org/docs/configurations/#hoodieparquetmaxfilesize
   
   num records per spark partitition is driven by 
`hoodie.copyonwrite.insert.auto.split`. Default value is true and so, will go 
with parquetMaxfileSize/avgRecordsSize. If this config is set to false, hudi 
relies on `hoodie.copyonwrite.insert.split.size` to determine how many records 
to assign to one spark partition.  
   
   btw, each operation has a different config for parallelism. just incase you 
weren't aware of it. 
   
   hoodie.upsert.shuffle.parallelism
   hoodie.insert.shuffle.parallelism
   hoodie.delete.shuffle.parallelism
   hoodie.bulkinsert.shuffle.parallelism
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to