Re: [I] [SUPPORT] Failed upsert schema compatibility check [hudi]

via GitHub Mon, 19 Aug 2024 04:52:23 -0700


2018yinjian commented on issue #11794:
URL: https://github.com/apache/hudi/issues/11794#issuecomment-2296390048


   version：old cluster = 0.12.0  new cluster = 0.14.0
   step: 
   1，new cluster：use spark-sql create table
   CREATE TABLE ods.ods_mct_track_result (
     data_create_time TIMESTAMP COMMENT '数据创建时间',
     data_update_time TIMESTAMP COMMENT '数仓写入时间',
     data_delete_flag INT COMMENT '源数据是否删除（1否2是）',
     kafka_offset BIGINT COMMENT 'offset',
     kafka_send_time TIMESTAMP COMMENT 'kafka发送时间',
     canal_create_time TIMESTAMP COMMENT 'canal创建时间',
     result_id STRING COMMENT '溯源结果ID',
     third_result_id STRING COMMENT '第三方结果ID',
     result_type INT COMMENT 
'溯源结果类型：10账户，20计划，30单元，40门店/站点推广数据；41门店/站点流量数据，50落地页，60关键词',
     result_date TIMESTAMP COMMENT '溯源结果日期，格式yyyyMMdd',
     keyword STRING COMMENT '推广关键词',
     page_id INT COMMENT '落地页ID',
     site_id INT COMMENT '推广门店、站点ID',
     cell_id INT COMMENT '推广单元ID',
     plan_id INT COMMENT '推广计划ID',
     account_id INT COMMENT '推广账户ID',
     channel_one_id INT COMMENT '一级渠道ID',
     channel_one_name STRING COMMENT '一级渠道名称',
     channel_two_id INT COMMENT '二级渠道ID',
     channel_two_name STRING COMMENT '二级渠道名称',
     channel_thr_id INT COMMENT '三级渠道ID',
     channel_thr_name STRING COMMENT '三级渠道名称',
     province_id INT COMMENT '省份id',
     province_name STRING COMMENT '省份名称',
     city_id INT COMMENT '城市id',
     city_name STRING COMMENT '城市名称',
     search_word STRING COMMENT '搜索词',
     creative_id INT COMMENT '创意ID',
     consume_amount INT COMMENT '消费金额，单位：分',
     consume_top_amount INT COMMENT '上方位消费金额',
     consume_top_sort INT COMMENT '上方位平均排名',
     show_count INT COMMENT '展现数',
     show_top_count INT COMMENT '上方为展现数',
     show_top_win DECIMAL(10,4) COMMENT '上方位展现胜出率',
     click_count INT COMMENT '点击数',
     click_top_count INT COMMENT '上方位点击数',
     clue_count INT COMMENT '媒介线索数',
     ask_click_count INT COMMENT '咨询按钮点击数',
     ask_clue_count INT COMMENT '咨询媒介线索数',
     phone_click_count INT COMMENT '电话按钮点击数',
     phone_succeed_count INT COMMENT '电话按钮拨通数',
     phone_clue_count INT COMMENT '电话媒介线索数',
     form_click_count INT COMMENT '表单按钮点击数',
     form_succeed_count INT COMMENT '表单提交成功数',
     form_clue_count INT COMMENT '表单媒介线索数',
     other_click_count INT COMMENT '其他按钮点击数',
     unique_index STRING COMMENT 
'唯一索引列明文，由account_id,result_type,plan_id,cell_id,site_id,page_id,third_result_id,result_date',
     unique_index_encrypt STRING COMMENT 
'唯一索引列密文，由account_id,result_type,plan_id,cell_id,site_id,page_id,third_result_id,result_date加起来MD5',
     exposal_person_count INT COMMENT '曝光人数',
     detail_person_count INT COMMENT '商详页访问人数',
     order_person_count INT COMMENT '下单人数',
     reserve_person_count INT COMMENT '预约人数（客资数）',
     write_off_person_count INT COMMENT '到店核销人数',
     page_time INT COMMENT '页面停留时间（单位:秒）',
     install INT COMMENT '安装',
     first_download INT COMMENT '新下载',
     redownload INT COMMENT '重新下载',
     ext STRING COMMENT '扩展信息',
     create_time TIMESTAMP COMMENT '创建时间',
     update_time TIMESTAMP COMMENT '修改时间')
   USING hudi
   PARTITIONED BY (region INT COMMENT '数仓分区字段(yyyyMMdd)')
   COMMENT '推广溯源结果信息'
   TBLPROPERTIES (
     'hoodie.table.keygenerator.class' = 
'org.apache.hudi.keygen.ComplexKeyGenerator',
     'hoodie.cleaner.fileversions.retained' = '1',
     'hoodie.cleaner.policy' = 'KEEP_LATEST_FILE_VERSIONS',
     'hoodie.metadata.enable' = 'false',
     'hoodie.schema.on.read.enable' = 'true',
     'preCombineField' = 'kafka_offset',
     'primaryKey' = 'result_id',
     'type' = 'cow')
   2，Cluster data migration：hadoop distcp 
hdfs://old-cluster:8020/user/hive/warehouse/ods.db/ods_mct_track_result/* 
hdfs://new-cluster:8020/user/hive/warehouse/ods.db/ods_mct_track_result
   3，New cluster new data is parsed into the lake：
   spark-submit --name offline_ods_analy_3001_promote --master yarn 
--deploy-mode client --num-executors 4 --executor-cores 2 --executor-memory 3g 
--driver-memory 1g --conf spark.executor.memoryOsverhead=1g --conf 
spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY --conf 
spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf 
spark.sql.avro.datetimeRebaseModeInWrite=LEGACY --conf 
spark.sql.avro.datetimeRebaseModeInRead=LEGACY --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.sql.legacy.timeParserPolicy=LEGACY' 
/tmp/yinjian/zmn_binlog_analysis2.py 3001 1 0 0 2024081617 
   4，New data parsing into the lake is abnormal. There is no rewriting of the 
historical data, because the historical data has 2 billion data, the contents 
of the.hoodie file should not be copied, and the reference is still the new 
cluster.hoodie
   5, It is expected that the hudi table structure is consistent, that there 
will be no schema inconsistencies in the parse data written to the new cluster, 
or that there is a more efficient way to migrate hudi tables between clusters 
without rewriting the data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Failed upsert schema compatibility check [hudi]

Reply via email to