zyclove commented on issue #10077:
URL: https://github.com/apache/hudi/issues/10077#issuecomment-1811962706
@ad1happy2go
Thank you very much for your reply despite your busy schedule.
After looking at the output results, there are indeed a lot of small files.
How should we solve this situation now? I couldn't run anymore. The table
creation statements and execution scripts are as follows.
```
spark-sql> show create table bi_ods_real.smart_datapoint_report_rw_clear_rt;
CREATE TABLE `bi_ods_real`.`smart_datapoint_report_rw_clear_rt` (
`_hoodie_commit_time` STRING,
`_hoodie_commit_seqno` STRING,
`_hoodie_record_key` STRING,
`_hoodie_partition_path` STRING,
`_hoodie_file_name` STRING,
`id` STRING COMMENT 'id',
`uuid` STRING COMMENT 'log uuid',
`data_id` STRING,
`dev_id` STRING COMMENT 'id',
`gw_id` STRING,
`product_id` STRING,
`uid` STRING COMMENT '用户ID',
`dp_code` STRING,
`dp_id` STRING COMMENT 'dp点',
`gmtModified` STRING,
`dp_name` STRING,
`dp_time` STRING,
`dp_type` STRING,
`dp_value` STRING,
`gmt_modified` BIGINT COMMENT 'ct 时间',
`dt` STRING COMMENT '时间分区字段',
`dp_mode` STRING)
USING hudi
PARTITIONED BY (dt, dp_mode)
COMMENT ''
TBLPROPERTIES (
'hoodie.bucket.index.num.buckets' = '50',
'primaryKey' = 'id',
'last_commit_time_sync' = '20231021185003298',
'hoodie.common.spillable.diskmap.type' = 'ROCKS_DB',
'hoodie.combine.before.upsert' = 'false',
'hoodie.compact.inline' = 'false',
'type' = 'mor',
'preCombineField' = 'gmt_modified',
'hoodie.datasource.write.partitionpath.field' = 'dt,dp_mode')
```
```
insert into bi_ods_real.smart_datapoint_report_rw_clear_rt
select
md5(concat(coalesce(data_id,''),coalesce(dev_id,''),coalesce(gw_id,''),coalesce(product_id,''),coalesce(uid,''),coalesce(dp_code,''),coalesce(dp_id,''),coalesce(gmtModified,''),if(dp_mode
in
('ro','rw','wr'),dp_mode,'un'),coalesce(dp_name,''),coalesce(dp_time,''),coalesce(dp_type,''),coalesce(dp_value,''),coalesce(ct,'')))
as id,
_hoodie_record_key as uuid,
data_id,dev_id,gw_id,product_id,uid,
dp_code,dp_id,gmtModified,if(dp_mode in ('ro','rw','wr'),dp_mode,'un')
as dp_mode ,dp_name,dp_time,dp_type,dp_value,
ct as gmt_modified,
case
when length(ct)=10 then
date_format(from_unixtime(ct),'yyyyMMddHH')
when length(ct)=13 then
date_format(from_unixtime(ct/1000),'yyyyMMddHH')
else '1970010100' end as dt
from
hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt',
'latest_state', '20231114033500000', '20231114040500000')
lateral view dataPointExplode(split(value,'\001')[0]) dps as ct,
data_id, dev_id, gw_id, product_id, uid, dp_code, dp_id, gmtModified, dp_mode,
dp_name, dp_time, dp_type, dp_value
where _hoodie_commit_time >20231114033500000 and
_hoodie_commit_time<=20231114040500000
```

The driver memory

Please help me find out how to solve it, thank you very much.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]