[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

GitBox Sun, 09 Aug 2020 19:43:54 -0700


RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671139855



   Thanks @bvaradar  for quick response. 
   
   We are loading a table initially which has 2TB size of data and each columns 
will be having huge data(html content) but not sure the exact size of each 
value. During initial snapshot we dont set the limtFileSize.Hence we leave to 
Hudi to use the default 120MB size.
   
   hoodie.copyonwrite.record.size.estimate - I haven't used this. I ll try this 
and let you know the outcome.
   
   I get "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB 
of 30 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead" while during the last phase of hudi i.e., 
during write. I hope this parameter should solve the issue.
   
   Regarding the bulk insert parallelism, we get the number of partitions of 
the existing table and set it has the bulk insert parallelism.
   
   In our case, 2TB data is close to 17000 partitions and hence bulk insert 
parallelism will be set to 17000.
   
   Please correct me//suggest if you have furthers points to add.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Reply via email to