cdmikechen opened a new issue #1481: [SUPPORT] If use a spark session deal with 
many tables, hudi cache may report `java.lang.OutOfMemoryError`
URL: https://github.com/apache/incubator-hudi/issues/1481
 
 
   **Describe the problem you faced**
   If use a spark session deal with many tables, hudi cache may report 
`java.lang.OutOfMemoryError`
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   In my hudi production enviroment, hudi process 4000+ tables every day and 
always use COW type. 
   The idea of processing each table is to use JDBC to get data from the 
original table (RDBMS) and compare it with the existing hudi data, so as to 
identify the data to be merged. At last, we use hudi's upsert method to update 
the existing data.
   Because the processing time set for each table is inconsistent, in order to 
ensure that the processing task can be executed immediately, we always keep the 
spark session open. In addition, in order to respond to some rest based 
requests, we use springboot to start spark session.
   This program has no problems and runs normally in the first many days, but 
it often appears some `java.lang.OutOfMemoryError` in one day. I check the logs 
and found some messages that some of hudi's cache might have caused 
`java.lang.OutOfMemoryError`. 
   So I was wondering if I should start a timeline service independently or 
perform hudi cache data cleanup on the tables that have been processed by hudi?
   
   
   **Expected behavior**
   
   Have no idea. Maybe hudi table cache can be stored by a single server or 
timeline server?
   
   **Environment Description**
   
   * Hudi version :
   0.5.1
   * Spark version :
   2.4.3
   * Hive version :
   2.3.3
   * Hadoop version :
   2.8.5
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   no
   
   **Stacktrace**
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to