zjureel opened a new pull request, #376: URL: https://github.com/apache/flink-table-store/pull/376
Currently, the table store uses the latest schema id to read the data file meta. When the schema evolves, it will cause errors, for example: 1. the schema of underlying data is [1->a, 2->b, 3->c, 4->d] and schema id is 0, where 1/2/3/4 is field id and a/b/c/d is field name 2. After schema evolution, schema id is 1, and the new schema is [1->a, 3->c, 5->f, 6->b, 7->g] When table store reads the field stats from data file meta, it should mapping schema 1 to 0 according to their field ids. This PR will read and parse the data according to the schema id in the meta file when reading the data file meta, and create index mapping from the table schema and the meta schema, so that the table store can read the correct file meta data through its latest schema. The main codes are as follows: 1. Added `SchemaFieldTypeExtractor` to extract key fields for `ChangelogValueCountFileStoreTable` and `ChangelogWithKeyFileStoreTable` 2. Added `SchemaEvolutionUtil` to create index mapping from table schema to meta file schema 3. Updated `FieldStatsArraySerializer` to read field stats with given index mapping The main tests include: 1. Added `SchemaEvolutionUtilTest` to create index mapping between two schemas. 2. Added `FieldStatsArraSerializerTest` to read meta from table schema 3. Added `AppendOnlyTableFileMetaFilterTest`, `ChangelogValueCountFileMetaFilterTest` and `ChangelogWithKeyFileMetaFilterTest` to filter old field, new field, partition field and primary key in data file meta in table scan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
