zjureel opened a new pull request, #396: URL: https://github.com/apache/flink-table-store/pull/396
Currently, the table store uses the latest schema id to read the data file meta. When the schema evolves, it will cause errors, for example: the schema of underlying data is [1->a, 2->b, 3->c, 4->d] and schema id is 0, where 1/2/3/4 is field id and a/b/c/d is field name After schema evolution, schema id is 1, and the new schema is [1->a, 3->c, 5->f, 6->b, 7->g] When table store reads underlying data file, it should use schema 0 with should [1->a, 2->b, 3->c, 4->d], and mapping schema 1 to 0 according to their field ids. This PR will read the data according to the schema id from the avro/orc/parquet data file, then create index mapping from the table schema and the underlying data schema, so that the table store can read the correct row data through its latest schema. The main codes are as follows: 1. Added method `valueFields` in `KeyValueFieldsExtractor` to extract fields from `TableSchema` 2. Added `AbstractFileRecordIterator` for `KeyValueDataFileRecordIterator` and `RowDataFileRecordIterator` to create projected row data from table schema to underlying row data 3. Added methods in `SchemaEvolutionUtil` to create index mapping between schemas, convert projection from table to underlying data 4. Added `BulkFormatMapping` to create reader factory and index mapping for `KeyValueFileReaderFactory` and `AppendOnlyFileStoreRead` The main tests include: 1. Updated `SchemaEvolutionUtilTest` to create index mapping and convert projection 2. Added `AppendOnlyFileDataTableTest`, `ChangelogValueCountFileDataTableTest` and `ChangelogWithKeyFileDataTableTest` to read and filter data after schema evolution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
