[
https://issues.apache.org/jira/browse/HUDI-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chao he updated HUDI-5413:
--------------------------
Description:
In the past, pv/uv was processed through flink + window aggregation. This
method has the risk of delayed data discarding and state explosion. We use
record count payload without these risks.
In order to use 'RecordCountAvroPayload', we need to add field
[hoodie_record_count bigint] to the schema when creating the hudi table to
record the result of pv/uv, field 'hoodie_record_count' does not need to be
filled, and flink will automatically set it to "null", "null" represents 1
eg:
Order field is 'ts', schema is :
{[
{"name":"id","type":"string"}
,
\{"name":"ts","type":"long"},
\{"name":"name","type":"string"},
\{"name":"hoodie_record_count","type":"long"}
]}
case 1
Current data:
id ts name hoodie_record_count
1 1 name_1 1
Insert data:
id ts name hoodie_record_count
1 2 name_2 2
Result data:
id ts name hoodie_record_count
1 2 name_2 3
case 2
Current data:
id ts name hoodie_record_count
1 2 name_1 null
Insert data:
id ts name hoodie_record_count
1 1 name_2 1
Result data:
id ts name hoodie_record_count
1 2 name_1 2
was:
In the past, pv/uv was processed through flink + window aggregation. This
method has the risk of delayed data discarding and state explosion. We use
record count payload without these risks.
In order to use 'RecordCountAvroPayload', we need to add field
[hoodie_record_count bigint] to the schema when creating the hudi table to
record the result of pv/uv.
eg:
Order field is 'ts', schema is :
{[ \\{"name":"id","type":"string"}
,
\{"name":"ts","type":"long"},
\{"name":"name","type":"string"},
\{"name":"hoodie_record_count","type":"long"}
]}
case 1
Current data:
id ts name hoodie_record_count
1 1 name_1 1
Insert data:
id ts name hoodie_record_count
1 2 name_2 2
Result data:
id ts name hoodie_record_count
1 2 name_2 3
case 2
Current data:
id ts name hoodie_record_count
1 2 name_1 null
Insert data:
id ts name hoodie_record_count
1 1 name_2 1
Result data:
id ts name hoodie_record_count
1 2 name_1 2
> Add record count payload to support pv/uv
> -----------------------------------------
>
> Key: HUDI-5413
> URL: https://issues.apache.org/jira/browse/HUDI-5413
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: chao he
> Priority: Major
>
> In the past, pv/uv was processed through flink + window aggregation. This
> method has the risk of delayed data discarding and state explosion. We use
> record count payload without these risks.
> In order to use 'RecordCountAvroPayload', we need to add field
> [hoodie_record_count bigint] to the schema when creating the hudi table to
> record the result of pv/uv, field 'hoodie_record_count' does not need to be
> filled, and flink will automatically set it to "null", "null" represents 1
> eg:
> Order field is 'ts', schema is :
> {[
> {"name":"id","type":"string"}
> ,
> \{"name":"ts","type":"long"},
> \{"name":"name","type":"string"},
> \{"name":"hoodie_record_count","type":"long"}
> ]}
> case 1
> Current data:
> id ts name hoodie_record_count
> 1 1 name_1 1
> Insert data:
> id ts name hoodie_record_count
> 1 2 name_2 2
> Result data:
> id ts name hoodie_record_count
> 1 2 name_2 3
> case 2
> Current data:
> id ts name hoodie_record_count
> 1 2 name_1 null
> Insert data:
> id ts name hoodie_record_count
> 1 1 name_2 1
> Result data:
> id ts name hoodie_record_count
> 1 2 name_1 2
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)