Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

via GitHub Wed, 06 Mar 2024 10:23:12 -0800


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1981519987


   > @huliwuli So, It looks like your per record size is really small. Hudi 
uses previous commit's statistics to guess future record sizes. For very first 
commit, it relies on the config "hoodie.copyonwrite.record.size.estimate" 
(default 1024). So setting it to a lower value might worked for you. Is that 
correct?
   > 
   > bulk_insert don't merge the small files out of the box. So you need to run 
clustering job for merging small files. If most of the time you just get 
inserts, then you may just use COW table. I assume by delete previous data you 
mean delete old partitions only.
   
   Thanks for the reply.   "hoodie.copyonwrite.record.size.estimate"  works on 
my MOR table when I set it to 30-40.    
   
   In most cases, we delete some rows for one old partition, but the number of 
rows is not predictable.  We currently use MoR, if you suggest we use the COW 
table, can I switch to COW directly from the hudi options?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

Reply via email to