[GitHub] [flink-table-store] zjureel opened a new pull request, #396: [FLINK-27846] Schema evolution for reading data file

GitBox Mon, 21 Nov 2022 21:58:26 -0800


zjureel opened a new pull request, #396:
URL: https://github.com/apache/flink-table-store/pull/396


   Currently, the table store uses the latest schema id to read the data file 
meta. When the schema evolves, it will cause errors, for example:
   
   the schema of underlying data is [1->a, 2->b, 3->c, 4->d] and schema id is 
0, where 1/2/3/4 is field id and a/b/c/d is field name
   After schema evolution, schema id is 1, and the new schema is [1->a, 3->c, 
5->f, 6->b, 7->g]
   When table store reads underlying data file, it should use schema 0 with 
should [1->a, 2->b, 3->c, 4->d], and mapping schema 1 to 0 according to their 
field ids.
   
   This PR will read the data according to the schema id from the 
avro/orc/parquet data file, then create index mapping from the table schema and 
the underlying data schema, so that the table store can read the correct row 
data through its latest schema.
   
   The main codes are as follows:
   
   1. Added method `valueFields` in `KeyValueFieldsExtractor` to extract fields 
from `TableSchema`
   2. Added `AbstractFileRecordIterator` for `KeyValueDataFileRecordIterator` 
and `RowDataFileRecordIterator` to create projected row data from table schema 
to underlying row data
   3. Added methods in `SchemaEvolutionUtil` to create index mapping between 
schemas, convert projection from table to underlying data
   4. Added `BulkFormatMapping` to create reader factory and index mapping for 
`KeyValueFileReaderFactory` and `AppendOnlyFileStoreRead`
   
   The main tests include:
   
   1. Updated `SchemaEvolutionUtilTest` to create index mapping and convert 
projection
   2. Added `AppendOnlyFileDataTableTest`, 
`ChangelogValueCountFileDataTableTest` and `ChangelogWithKeyFileDataTableTest` 
to read and filter data after schema evolution
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-table-store] zjureel opened a new pull request, #396: [FLINK-27846] Schema evolution for reading data file

Reply via email to