[I] Streaming writes to a `Partial Update` table, data appears to be lost [incubator-paimon]

via GitHub Sat, 07 Oct 2023 01:03:22 -0700


hililiwei opened a new issue, #2091:
URL: https://github.com/apache/incubator-paimon/issues/2091


   ### Search before asking
   
   - [X] I searched in the 
[issues](https://github.com/apache/incubator-paimon/issues) and found nothing 
similar.
   
   
   ### Paimon version
   
   0.5
   
   ### Compute Engine
   
   Flink 1.14
   Spark 3.1
   
   
   ### Minimal reproduce step
   
   table:
   ```sql
        CREATE TABLE  
`paimon_partial`.`adsoaidrcm`.ads_rcm_ad_feature_reflow_origin_hm_partial(
                req_id string ,
                slot_id string ,
                slot_seq string ,
                task_id string ,
                creative_id string ,
                features string ,
                others Map<String, String>,
                feature_set_name string,
                event_time bigint,
                show_label int,
                click_label int,
                pt_h string,
                req_pt_h string,
                PRIMARY KEY (req_id,slot_id,slot_seq,task_id,creative_id,pt_h) 
NOT ENFORCED)
        )
        PARTITIONED BY (pt_h)
        WITH (
                'merge-engine'='partial-update',
                'bucket'='4000',
                'bucket-key'='req_id'，
           
'path'='obs://xxxxxxxxxxx/ads_rcm_ad_feature_reflow_origin_hm_partial '
        );
   ```
   
   Flink job:
   
   ```sql
   insert into 
`paimon_partial`.`adsoaidrcm`.ads_rcm_ad_feature_reflow_origin_hm_partial /*+ 
OPTIONS('sink.parallelism'='1400','sink.use-managed-memory-allocator'='true', 
'sink.managed.writer-buffer-memory'='1G','manifest.target-file-size'='64M','num-sorted-run.stop-trigger'='2147483647','sort-spill-threshold'='10','num-sorted-run.compaction-trigger'='10','snapshot.time-retained'='3650d')
 */ 
   SELECT req_id,slot_id,slot_seq,task_id,creative_id,features,
          cast(null as Map<String, String>) as others,
          feature_set_name,log_time as event_time, cast(null as int) as 
show_label, cast(null as int) as click_label,
       DATE_FORMAT(TO_TIMESTAMP(FROM_UNIXTIME(log_time / 1000)), 'yyyyMMddHH') 
as pt_h
   FROM `hive`.`adsoaidrcm`.rcm_ad_feature_structured_data_log /*+ 
OPTIONS('source.parallelism'='600','properties.group.id'='rcm_ad_feature_structured_data_log_partial')
 */
   WHERE feature_set_name='pctr_third_party_fusion_v8_collections'
   UNION ALL
   SELECT req_id,slot_id,slot_seq,task_id,creative_id,cast(null as string) as 
features, other_fields, cast(null as string) as feature_set_name,
          cast(null as bigint) as event_time,
          0 as show_label,cast(null as int) as click_label,
          DATE_FORMAT(cast(other_fields['record_time'] as timestamp), 
'yyyyMMddHH') as pt_h
   FROM `hive`.`adsoaidrcm`.rcm_ad_feature_reflow_user_show_log  /*+ 
OPTIONS('source.parallelism'='200','properties.group.id'='rcm_ad_feature_reflow_user_show_log_partial')
 */
   WHERE aid is not null and aid <> '' and is_valid='1'
   UNION ALL
   SELECT req_id,slot_id,slot_seq,task_id,creative_id,cast(null as string) as 
features,cast(null as Map<String, String>), cast(null as string) as 
feature_set_name,
          cast(null as bigint) as event_time,
          cast(null as int) as show_label,  1 as click_label,
          DATE_FORMAT(cast(other_fields['record_time'] as timestamp), 
'yyyyMMddHH') as pt_h
   FROM `hive`.`adsoaidrcm`.rcm_ad_feature_reflow_user_click_log  /*+ 
OPTIONS('source.parallelism'='35','properties.group.id'='rcm_ad_feature_reflow_user_click_log_partial')
 */
   WHERE aid is not null and aid <> '' and is_valid='1';
   ```
   
   Spark sql:
   ```sql
   select count(1) from 
iceberg_partial.adsoaidrcm.ads_rcm_ad_feature_reflow_origin_hm_partial  where 
others['aid'] is not null and others['aid'] <> '' and others['isvalid']='1' and 
(click_label is not null or show_label is not null) and pt_h='2023100710'
   ```
   
   ### What doesn't meet your expectations?
   
   The amount of data is less than expected
   
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Streaming writes to a `Partial Update` table, data appears to be lost [incubator-paimon]

Reply via email to