[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

GitBox Mon, 10 Aug 2020 22:54:21 -0700


RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671742308



   Yes @bvaradar we do an initial bulk insert and then upsert for subsequent 
operations.! I configured hoodie.copyonwrite.record.size.estimate to 128 while 
taking initial load via bulk insert. But during subsequent upserts, we face 
memory issues as stated above and  streaming jobs are getting failed... But we 
are sure the size of 10mill records is close to 10GB and we have given 
sufficient executor memory(60GB per executor and 4 cores)..
   
   We use Dstream and number of records for each micro batch is 10 mil and size 
of the batch is 10GB.
   
   We persist the RDD(10GB) in disk because we reuse RDD for upsert and 
subsequent deletes. What i can see from storage tab in spark is, Hudi do 
persist the data internally in memory. I tried configuring 
hoodie.write.status.storage.level to Disk to leave more memory for tasks.. But 
Hudi always persists in memory? Any thoughts on this prop? will this be a 
reason for memory isue?
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Reply via email to