hudi-bot opened a new issue, #17291:
URL: https://github.com/apache/hudi/issues/17291
We need to ensure that we cover the following cases for basic col stats
certification:
# insert few records validate. update the same and validate updates are
reflected. repeat the updates and validate stats.
for MOR, trigger compaction and validate.
# For MOR, let ensure we cover all log block types (data blocks, delete
blocks, and rollback blocks)
# trigger clustering on top of 1 and validate stats. a. for MOR, lets
trigger clustering before compaction and also after compaction. ensure that no
stats are available for the replaced file groups.
# insert few records, update. and delete subset of records which should
impact the min and max values. validate.
# lets add a test for async compaction and validate. i.e. some log files
are added to new phantom file slice and stats are intact.
# lets have a test for non partitioned table.
# Trigger clean and ensure cleaned up files are deleted from col stats.
Should not even return null stats.
# lets trigger rollbacks and validate. i.e. insert, update (partially
failed). validate that only stats pertianing to inserts are reflected. trigger
a rollback and validate its still the same. retry the updates. stats should
reflect stats w/ updated records.
# lets add one long running tests. i.e with 20+ commits and aggressive
cleaner and archival. just for sanity. or if we can enable all kinds of index
in an existing sanity tests, we should be good.
# lets test all write operations. bulk_insert, insert, upsert, delete,
insert_overwrite, insert_overwrite_table, delete_partition.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-8384
- Type: Sub-task
- Parent: https://issues.apache.org/jira/browse/HUDI-8727
- Fix version(s):
- 1.1.0
---
## Comments
18/Oct/24 00:17;shivnarayan;Here are the follow ups from the AUDITing we did
on col stats design and pruning.
https://issues.apache.org/jira/browse/HUDI-8388
https://issues.apache.org/jira/browse/HUDI-8389
https://issues.apache.org/jira/browse/HUDI-8390
;;;
---
20/Mar/25 14:56;codope;In `TestColumnStatsIndex`, following tests cover the
listed scenarios in description:
testMetadataColumnStatsIndexInitializationWithUpserts,
testMetadataColumnStatsIndexCompactionWithSQL: Case 1, 4, 6
testMetadataColumnStatsIndexInitializationWithRollbacks,
testMORDeleteBlocks: case 2
testPartitionStatsWithClustering - case 3
testColStatsWithCleanCOW - case 7
case 10 - we have tests for most of the write ops except - insert_overwrite,
insert_overwrite_table
In summary, we need following tests:
Case 5: lets add a test for async compaction and validate. i.e. some log
files are added to new phantom file slice and stats are intact.
Case 8: lets trigger rollbacks and validate. i.e. insert, update (partially
failed). validate that only stats pertianing to inserts are reflected. trigger
a rollback and validate its still the same. retry the updates. stats should
reflect stats w/ updated records. Here, we can extend
`testMetadataColumnStatsIndexInitializationWithRollbacks` to do updates after
rollback.
Case 9: lets add one long running tests. i.e with 20+ commits and aggressive
cleaner and archival. just for sanity. or if we can enable all kinds of index
in an existing sanity tests, we should be good.
Case 10: only for insert_overwrite, insert_overwrite_table
;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]