[
https://issues.apache.org/jira/browse/HUDI-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-8360:
--------------------------------------
Description:
We need to ensure that we cover the following cases:
# insert few records validate. update the same and validate updates are
reflected. repeat the updates and validate stats.
for MOR, trigger compaction and validate.
# trigger clustering on top of 1 and validate stats. a. for MOR, lets trigger
clustering before compaction and also after compaction. ensure that no stats
are available for the replaced file groups.
# insert few records, update. and delete subset of records which should impact
the min and max values. validate.
# lets add a test for async compaction and validate. i.e. some log files are
added to new phantom file slice and stats are intact.
# lets have a test for non partitioned table.
# lets trigger rollbacks and validate. i.e. insert, update (partially failed).
validate that only stats pertianing to inserts are reflected. trigger a
rollback and validate its still the same. retry the updates. stats should
reflect stats w/ updated records.
# lets add one long running tests. i.e with 20+ commits and aggressive cleaner
and archival. just for sanity. or if we can enable all kinds of index in an
existing sanity tests, I am good.
# lets test all write operations. bulk_insert, insert, upsert, delete,
insert_overwrite, insert_overwrite_table, delete_partition.
# add a test for non partitioned dataset as well (for the unmerged log record
reading flow)
Above tests are generic for any indexes.
On testing secondary index per se, lets try to cover below scenarios.
a. add 100 entries w/ a mix of secondary index values. and
update 1 of them to another sec index value that already exists.
update 1 of them to a new sec index value.
delete one of the record whose sec index has other primary keys referenced.
delete one of the record whose secondary index value is not referenced by
any other primary key values.
b. add 100 entries w/ a mix of secondary index values.
update one subset of records for a given sec index value to another.
delete one subset of records for a given sec index value.
insert new records to new sec index value.
c. add 100 entries w/ a mix of secondary index values.
update one subset of records for a given sec index value to another.
delete one subset of records for a given sec index value.
insert new records overlapping w/ the second index value which got updated
and deleted in this batch.
d. add 100 entries w/ just 1 sec index value.
update a subset to N no of sec index values.
delete a subset of them.
On the validation front:
Ensure to read back all records and match the entire expected list rather than
just 1 entry lookup.
Ensure MDT compaction kicks in and repeat the validation
> Add functional test for Secondary Index
> ---------------------------------------
>
> Key: HUDI-8360
> URL: https://issues.apache.org/jira/browse/HUDI-8360
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Lin Liu
> Assignee: Lin Liu
> Priority: Blocker
> Fix For: 1.0.0
>
> Original Estimate: 10h
> Remaining Estimate: 10h
>
>
> We need to ensure that we cover the following cases:
> # insert few records validate. update the same and validate updates are
> reflected. repeat the updates and validate stats.
> for MOR, trigger compaction and validate.
> # trigger clustering on top of 1 and validate stats. a. for MOR, lets
> trigger clustering before compaction and also after compaction. ensure that
> no stats are available for the replaced file groups.
> # insert few records, update. and delete subset of records which should
> impact the min and max values. validate.
> # lets add a test for async compaction and validate. i.e. some log files are
> added to new phantom file slice and stats are intact.
> # lets have a test for non partitioned table.
> # lets trigger rollbacks and validate. i.e. insert, update (partially
> failed). validate that only stats pertianing to inserts are reflected.
> trigger a rollback and validate its still the same. retry the updates. stats
> should reflect stats w/ updated records.
> # lets add one long running tests. i.e with 20+ commits and aggressive
> cleaner and archival. just for sanity. or if we can enable all kinds of index
> in an existing sanity tests, I am good.
> # lets test all write operations. bulk_insert, insert, upsert, delete,
> insert_overwrite, insert_overwrite_table, delete_partition.
> # add a test for non partitioned dataset as well (for the unmerged log
> record reading flow)
>
> Above tests are generic for any indexes.
> On testing secondary index per se, lets try to cover below scenarios.
> a. add 100 entries w/ a mix of secondary index values. and
> update 1 of them to another sec index value that already exists.
> update 1 of them to a new sec index value.
> delete one of the record whose sec index has other primary keys
> referenced.
> delete one of the record whose secondary index value is not referenced by
> any other primary key values.
> b. add 100 entries w/ a mix of secondary index values.
> update one subset of records for a given sec index value to another.
> delete one subset of records for a given sec index value.
> insert new records to new sec index value.
> c. add 100 entries w/ a mix of secondary index values.
> update one subset of records for a given sec index value to another.
> delete one subset of records for a given sec index value.
> insert new records overlapping w/ the second index value which got
> updated and deleted in this batch.
> d. add 100 entries w/ just 1 sec index value.
> update a subset to N no of sec index values.
> delete a subset of them.
>
> On the validation front:
> Ensure to read back all records and match the entire expected list rather
> than just 1 entry lookup.
> Ensure MDT compaction kicks in and repeat the validation
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)