dexty007 opened a new issue, #9877: URL: https://github.com/apache/seatunnel/issues/9877
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened ### Critical Bug: Non-deterministic schema generation in multi-table Spark pipelines causes silent data corruption ### Root Cause: MultiTableManager.mergeSchema() uses table iteration order to assign global column indices, making schema generation non-deterministic when the same tables are processed in different orders. ### Algorithm Issue: java **// MultiTableManager.java:148-164 mergeSchema() method** ``` for (int i = 0; i < catalogTables.length; i++) { // Processing order determines global field positions if (!indexQueue.hasNext()) { indexSize++; // Sequential assignment based on iteration order fieldNames.add(editColumnName(indexSize)); fieldTypes.add(seaTunnelDataTypes[j]); } } ``` ### Failure Scenario: 1. Execution 1: [TableA(INT,STRING), TableB(STRING,LONG)] → schema: [INT, STRING, LONG] 2. Execution 2: [TableB(STRING,LONG), TableA(INT,STRING)] → schema: [STRING, LONG, INT] 3. Result: Same data encoded with different schemas → silent data corruption ### Impact: • **Silent data corruption**: Wrong data types in wrong positions • **Non-deterministic behavior**: Same pipeline produces different results • **Production risk**: Financial/business data corruption without error indication • **Debugging difficulty**: Appears as mysterious data inconsistencies ### Evidence: Current tests only cover identical schemas where order changes are invisible (MultiTableManagerTest.java:105-106), masking this critical bug for heterogeneous schemas. ### Expected Behavior: Same set of tables should always produce identical merged schema regardless of processing order. ### Suggested Fix: Make schema generation order-independent: java ``` // Sort tables by deterministic identifier before processing Arrays.sort(catalogTables, Comparator.comparing(t -> t.getTablePath().toString())); ``` Priority: P0 - Silent data corruption in core multi-table functionality ### SeaTunnel Version 2.3.x (affects all versions with multi-table support) ### SeaTunnel Config ```conf hocon env { execution.parallelism = 1 } source { FakeSource { tables_configs = [ { schema = { table = "users" fields { id = int name = string } } }, { schema = { table = "orders" fields { description = string amount = bigint } } } ] } } transform { FieldMapper { field_mapper = { id = user_id } } } sink { Console {} } ``` ### Running Command ```shell ./bin/seatunnel.sh --config config/multi-table-heterogeneous.conf --engine spark ``` ### Error Exception ```log No explicit exception - silent data corruption occurs. Data appears in wrong columns due to schema mismatch between encoding and decoding stages. ``` ### Zeta or Flink or Spark Version Spark 3.x (affects all Spark versions) ### Java or Scala Version _No response_ ### Screenshots _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
