[ 
https://issues.apache.org/jira/browse/HUDI-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882772#comment-17882772
 ] 

Lokesh Jain edited comment on HUDI-8208 at 9/18/24 5:46 PM:
------------------------------------------------------------

The fix requires that partition stats should be newly generated during 
clustering, compaction or replace commit operations. We need to add the records 
for such operations with overwrite flag so that older partition stat records 
are discarded. There are two approaches which can be considered here.
1. We include file slice stats within partition stats. The file slice stats are 
updated based on commit metadata stats and these are used to calculate the 
aggregate stats for the partition during replace commit or compaction 
operations.

2. We always enable column stat with partition stats. This can be done 
internally. We can then query column stats to generate aggregate partition 
stats for clustering, compaction or replace commit operations.

Pros/Cons
1. Approach 1 requires querying only partition stats. Partitions stats 
partitions would have lesser number of records equal to number of partitions in 
the table. Also the partition stat is not dependent on column stat and can be 
independently enabled using this approach.

2. Approach 2 allows to reuse existing column stats for finding stats for all 
file slices of a partition. It also avoids additional metadata and logic for 
maintaining these file stats in partition stats. It allows reuse of metadata 
from column stats.
3. Also it is better to merge these two indexes into a single index ideally 
column stat index. We can store the partition stats using the column and 
partition path as the key. It is better to go with approach 2 so that is easier 
to integrate these indexes later.

4. Approach 1 requires querying only the partition stats and does not require 
querying the files partition. Column stats would however need to be queried 
using latest file slice info from files partition. This is not a major flaw 
since col stats can be improved to avoid the files partition querying. Also it 
is better to improve the column stats partition so that at any time it only 
holds the latest file slice stats for a particular partition.


was (Author: ljain):
The fix requires that partition stats should be newly generated during 
clustering, compaction or replace commit operations. We need to add the records 
for such operations with overwrite flag so that older partition stat records 
are discarded. There are two approaches which can be considered here.
1. We include file slice stats within partition stats. The file slice stats are 
updated based on commit metadata stats and these are used to calculate the 
aggregate stats for the partition.

2. We always enable column stat with partition stats. This can be done 
internally. We can then query column stats to generate aggregate partition 
stats for clustering, compaction or replace commit operations.

Pros/Cons
1. Approach 1 requires querying only partition stats. Partitions stats 
partitions would have lesser number of records equal to number of partitions in 
the table. Also the partition stat is not dependent on column stat and can be 
independently enabled using this approach.

2. Approach 2 allows to reuse existing column stats for finding stats for all 
file slices of a partition. It also avoids additional metadata and logic for 
maintaining these file stats in partition stats. It allows reuse of metadata 
from column stats.
3. Also it is better to merge these two indexes into a single index ideally 
column stat index. We can store the partition stats using the column and 
partition path as the key. It is better to go with approach 2 so that is easier 
to integrate these indexes later.

4. Approach 1 requires querying only the partition stats and does not require 
querying the files partition. Column stats would however need to be queried 
using latest file slice info from files partition. This is not a major flaw 
since col stats can be improved to avoid the files partition querying. Also it 
is better to improve the column stats partition so that at any time it only 
holds the latest file slice stats for a particular partition.

> Fix partition stats with compaction or clustering
> -------------------------------------------------
>
>                 Key: HUDI-8208
>                 URL: https://issues.apache.org/jira/browse/HUDI-8208
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: metadata
>            Reporter: Lokesh Jain
>            Assignee: Lokesh Jain
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> Consider a partition with 10 file slices. If compaction triggered for 1 file 
> slice fs1_1, the partition stats are updated for that file slice with the 
> same key (partition path). The older partition stat record for that partition 
> path would account for the other 9 file slices (fs2_0 - fs10_0) + the older 
> stat (fs1_0). The final read value would be merging of all versions of file 
> slices (fs2_0 - fs10_0, fs1_0, fs1_1). It should only account for the latest 
> version of fs1.
> Upon compaction or clustering, the partition stat should be recomputed and 
> the older records for that partition should be invalidated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to