RajasekarSribalan commented on issue #1939: URL: https://github.com/apache/hudi/issues/1939#issuecomment-671742308
Yes @bvaradar we do an initial bulk insert and then upsert for subsequent operations.! I configured hoodie.copyonwrite.record.size.estimate to 128 while taking initial load via bulk insert. But during subsequent upserts, we face memory issues as stated above and streaming jobs are getting failed... But we are sure the size of 10mill records is close to 10GB and we have given sufficient executor memory(60GB per executor and 4 cores).. We use Dstream and number of records for each micro batch is 10 mil and size of the batch is 10GB. We persist the RDD(10GB) in disk because we reuse RDD for upsert and subsequent deletes. What i can see from storage tab in spark is, Hudi do persist the data internally in memory. I tried configuring hoodie.write.status.storage.level to Disk to leave more memory for tasks.. But Hudi always persists in memory? Any thoughts on this prop? will this be a reason for memory isue? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
