Re: [I] [SUPPORT] async clean service java.lang.OutOfMemoryError: Java heap space [hudi]

via GitHub Tue, 14 Nov 2023 23:52:22 -0800


zyclove commented on issue #10077:
URL: https://github.com/apache/hudi/issues/10077#issuecomment-1811962706


   @ad1happy2go 
   
   Thank you very much for your reply despite your busy schedule.
   After looking at the output results, there are indeed a lot of small files. 
How should we solve this situation now? I couldn't run anymore. The table 
creation statements and execution scripts are as follows.
   
   ```
   spark-sql> show create table bi_ods_real.smart_datapoint_report_rw_clear_rt;
   CREATE TABLE `bi_ods_real`.`smart_datapoint_report_rw_clear_rt` (
     `_hoodie_commit_time` STRING,
     `_hoodie_commit_seqno` STRING,
     `_hoodie_record_key` STRING,
     `_hoodie_partition_path` STRING,
     `_hoodie_file_name` STRING,
     `id` STRING COMMENT 'id',
     `uuid` STRING COMMENT 'log uuid',
     `data_id` STRING,
     `dev_id` STRING COMMENT 'id',
     `gw_id` STRING,
     `product_id` STRING,
     `uid` STRING COMMENT '用户ID',
     `dp_code` STRING,
     `dp_id` STRING COMMENT 'dp点',
     `gmtModified` STRING,
     `dp_name` STRING,
     `dp_time` STRING,
     `dp_type` STRING,
     `dp_value` STRING,
     `gmt_modified` BIGINT COMMENT 'ct 时间',
     `dt` STRING COMMENT '时间分区字段',
     `dp_mode` STRING)
   USING hudi
   PARTITIONED BY (dt, dp_mode)
   COMMENT ''
   TBLPROPERTIES (
     'hoodie.bucket.index.num.buckets' = '50',
     'primaryKey' = 'id',
     'last_commit_time_sync' = '20231021185003298',
     'hoodie.common.spillable.diskmap.type' = 'ROCKS_DB',
     'hoodie.combine.before.upsert' = 'false',
     'hoodie.compact.inline' = 'false',
     'type' = 'mor',
     'preCombineField' = 'gmt_modified',
     'hoodie.datasource.write.partitionpath.field' = 'dt,dp_mode')
   ``` 
   
   ```
   insert into bi_ods_real.smart_datapoint_report_rw_clear_rt 
   select
         
md5(concat(coalesce(data_id,''),coalesce(dev_id,''),coalesce(gw_id,''),coalesce(product_id,''),coalesce(uid,''),coalesce(dp_code,''),coalesce(dp_id,''),coalesce(gmtModified,''),if(dp_mode
 in 
('ro','rw','wr'),dp_mode,'un'),coalesce(dp_name,''),coalesce(dp_time,''),coalesce(dp_type,''),coalesce(dp_value,''),coalesce(ct,'')))
 as id, 
         _hoodie_record_key as uuid,
         data_id,dev_id,gw_id,product_id,uid,
         dp_code,dp_id,gmtModified,if(dp_mode in ('ro','rw','wr'),dp_mode,'un') 
as dp_mode ,dp_name,dp_time,dp_type,dp_value,
         ct as gmt_modified,
         case 
             when length(ct)=10 then 
date_format(from_unixtime(ct),'yyyyMMddHH')  
             when length(ct)=13 then 
date_format(from_unixtime(ct/1000),'yyyyMMddHH') 
             else '1970010100' end as dt
   from 
       
hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt', 
'latest_state', '20231114033500000', '20231114040500000')  
       lateral  view dataPointExplode(split(value,'\001')[0]) dps as ct, 
data_id, dev_id, gw_id, product_id, uid, dp_code, dp_id, gmtModified, dp_mode, 
dp_name, dp_time, dp_type, dp_value
   where _hoodie_commit_time >20231114033500000 and 
_hoodie_commit_time<=20231114040500000
   ``` 
   
   
![image](https://github.com/apache/hudi/assets/15028279/b56e274f-f6fb-46a7-85af-67eaa5894ce1)
   
   The driver memory 
   
![image](https://github.com/apache/hudi/assets/15028279/f135f84c-a906-4391-87d4-eaf7585998e6)
   
   
   Please help me find out how to solve it, thank you very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] async clean service java.lang.OutOfMemoryError: Java heap space [hudi]

Reply via email to