Re: [PR] [improve] s3 redshift & hive options [seatunnel]

via GitHub Tue, 10 Feb 2026 07:52:30 -0800


DanielCarter-stack commented on PR #10480:
URL: https://github.com/apache/seatunnel/pull/10480#issuecomment-3878788683


   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10480", "part": 1, 
"total": 1} -->
   ## Issue 1: pluginConfig in S3RedshiftSink not initialized causing 
NullPointerException
   
   **Location**: `S3RedshiftSink.java:32-34`
   
   ```java
   public S3RedshiftSink(ReadonlyConfig readonlyConfig) {
       hadoopConf = S3HadoopConf.buildWithReadOnlyConfig(readonlyConfig);
       // Missing: this.pluginConfig = readonlyConfig;
   }
   ```
   
   **Related Context**:
   - Parent class: `BaseFileSink.java:48` - `protected ReadonlyConfig 
pluginConfig;`
   - Caller: `S3RedshiftFactory.java:79` - `return () -> new 
S3RedshiftSink(context.getOptions());`
   - Dependent: `S3RedshiftSink.java:44` - `new 
S3RedshiftSinkAggregatedCommitter(hadoopConf, pluginConfig)`
   - Dependent: `S3RedshiftSinkAggregatedCommitter.java:51` - `this.executeSql 
= pluginConfig.get(...)`
   
   **Issue Description**:
   The `S3RedshiftSink` constructor only initializes `hadoopConf` and does not 
set the `pluginConfig` field inherited from `BaseFileSink`. When 
`createAggregatedCommitter()` is called, it will pass `null` to 
`S3RedshiftSinkAggregatedCommitter`, causing all subsequent configuration reads 
(such as `pluginConfig.get(EXECUTE_SQL)`) to throw `NullPointerException`.
   
   **Potential Risks**:
   - **Runtime NPE**: Job crashes during commit phase
   - **Data loss**: Files already written to S3 cannot be loaded by Redshift
   - **Resource leak**: Temporary files cannot be cleaned up
   
   **Impact Scope**:
   - **Direct impact**: S3RedshiftSink completely unusable
   - **Indirect impact**: All ETL jobs using S3Redshift
   - **Impact surface**: Single Connector (Critical)
   
   **Severity**: **BLOCKER**
   
   **Improvement Suggestion**:
   ```java
   public S3RedshiftSink(ReadonlyConfig readonlyConfig) {
       this.pluginConfig = readonlyConfig;  // Add this line
       hadoopConf = S3HadoopConf.buildWithReadOnlyConfig(readonlyConfig);
   }
   ```
   
   **Rationale**: `pluginConfig` is a field of the parent class `BaseFileSink`, 
and the subclass must initialize it during construction, otherwise subsequent 
usage will lead to NPE.
   
   ---
   
   ## Issue 2: BaseFileSink.prepare() method deleted causing subclass 
compilation failure
   
   **Location**: `BaseFileSink.java:192-202` (deleted)
   
   ```java
   // Old code (deleted)
   @Override
   public void prepare(Config pluginConfig) throws PrepareFailException {
       this.pluginConfig = pluginConfig;
   }
   ```
   
   **Related Context**:
   - Subclass caller 1: `BaseHdfsFileSink.java:48` - 
`super.prepare(pluginConfig);`
   - Subclass caller 2: `CosFileSink.java:49` - `super.prepare(pluginConfig);`
   - There may be other unchecked subclasses
   
   **Issue Description**:
   The `BaseFileSink.prepare()` method has been deleted, but subclasses such as 
`BaseHdfsFileSink` and `CosFileSink` are still calling 
`super.prepare(pluginConfig)`. This will cause compilation errors. According to 
the diff, `BaseHdfsFileSink` was not modified, so it will directly fail to 
compile.
   
   **Potential Risks**:
   - **Compilation failure**: All Connectors inheriting from `BaseHdfsFileSink` 
(including S3Redshift) cannot compile
   - **Block release**: This is a BLOCKER-level compilation error
   
   **Impact Scope**:
   - **Direct impact**: `BaseHdfsFileSink` and all its subclasses
   - **Indirect impact**: All HDFS-based Sink Connectors (HDFS, S3, OSS, COS, 
OBS, etc.)
   - **Impact surface**: Multiple Connectors (Critical)
   
   **Severity**: **CRITICAL**
   
   **Improvement Suggestion**:
   **Option 1**: Keep `BaseFileSink.prepare()` but mark as Deprecated
   ```java
   @Deprecated
   @Override
   public void prepare(Config pluginConfig) throws PrepareFailException {
       this.pluginConfig = ReadonlyConfig.fromConfig(pluginConfig);
   }
   ```
   
   **Option 2**: Modify all subclasses simultaneously and delete 
`super.prepare()` calls (large amount of work)
   
   **Rationale**: Must ensure backward compatibility until all Connectors 
migrate to Factory pattern.
   
   ---
   
   ## Issue 3: Test code not updated to adapt to new API
   
   **Location**: 
   - `FileSinkConfigTest.java:49, 63, 79`
   - `CsvWriteStrategyTest.java:66, 129`
   - `ParquetWriteStrategyTest.java:82`
   - `OrcWriteStrategyTest.java:67`
   
   **Issue Description**:
   Test files are still using the old `ConfigFactory.parseMap()` API:
   ```java
   // Test code (unchanged)
   new FileSinkConfig(ConfigFactory.parseMap(writeConfig), writeRowType)
   ```
   
   But the `FileSinkConfig` constructor now expects `ReadonlyConfig`, which 
will cause:
   1. If the test can compile (there may be implicit conversion), it is 
type-unsafe
   2. More likely, the test cannot compile or run at all
   
   **Potential Risks**:
   - **Test failure**: CI/CD pipeline will report errors
   - **Regression risk**: Cannot verify the correctness of modifications
   
   **Impact Scope**:
   - **Direct impact**: Unit tests for File connector
   - **Impact surface**: Test suite for single Connector
   
   **Severity**: **MAJOR**
   
   **Improvement Suggestion**:
   Update all test files to use the new API:
   ```java
   // Old code
   new FileSinkConfig(ConfigFactory.parseMap(writeConfig), writeRowType)
   
   // New code
   new FileSinkConfig(ReadonlyConfig.fromMap(writeConfig), writeRowType)
   ```
   
   Or if `ReadonlyConfig` has other adaptation methods, use the corresponding 
API.
   
   **Rationale**: Must ensure tests can pass to verify the correctness of the 
PR.
   
   ---
   
   ## Issue 4: S3RedshiftSink lacks configuration validation
   
   **Location**: `S3RedshiftSink.java:32-34` (same as Issue 1, but different 
aspect)
   
   **Issue Description**:
   The old code had complete configuration validation in the `prepare()` method:
   ```java
   CheckResult checkResult = CheckConfigUtil.checkAllExists(
       pluginConfig,
       S3FileBaseOptions.S3_BUCKET.key(),
       S3RedshiftConfigOptions.JDBC_URL.key(),
       // ... other required fields
   );
   ```
   
   The new code's constructor has no validation, entirely relying on 
`S3RedshiftFactory.optionRule()` declarations. However, if users bypass the 
Factory (directly construct Sink), or modify configuration after construction, 
errors will only be exposed at runtime (commit phase).
   
   **Potential Risks**:
   - **Delayed error reporting**: Configuration errors are exposed in the 
middle of job execution, wasting resources
   - **Poor user experience**: Error messages are unclear (NPE instead of 
"missing required config")
   
   **Impact Scope**:
   - **Direct impact**: Configuration error diagnosis for S3Redshift users
   - **Impact surface**: Single Connector
   
   **Severity**: **MAJOR**
   
   **Improvement Suggestion**:
   Add validation in the `S3RedshiftSink` constructor, or rely on 
`S3HadoopConf.buildWithReadOnlyConfig()` internal validation:
   ```java
   public S3RedshiftSink(ReadonlyConfig readonlyConfig) {
       this.pluginConfig = readonlyConfig;
       // S3HadoopConf.buildWithReadOnlyConfig should validate required fields
       hadoopConf = S3HadoopConf.buildWithReadOnlyConfig(readonlyConfig);
       
       // Or validate here
       readonlyConfig.get(S3FileBaseOptions.S3_BUCKET); // Will throw an 
exception if it doesn't exist
       readonlyConfig.get(S3RedshiftConfigOptions.JDBC_URL);
       // ...
   }
   ```
   
   **Rationale**: Fail-fast principle; configuration errors should be exposed 
immediately at startup.
   
   ---
   
   ## Issue 5: pluginConfig in BaseFileSink.setTypeInfo() may be null
   
   **Location**: `BaseFileSink.java:77-79`
   
   ```java
   @Override
   public void setTypeInfo(SeaTunnelRowType seaTunnelRowType) {
       this.seaTunnelRowType = seaTunnelRowType;
       this.fileSinkConfig = new FileSinkConfig(pluginConfig, seaTunnelRowType);
   }
   ```
   
   **Issue Description**:
   `setTypeInfo()` is a method of the `SeaTunnelSink` interface and may be 
called before or after `prepare()`. In the new Factory pattern, 
`S3RedshiftSink` no longer has a `prepare()` method, so `pluginConfig` may 
still be `null` when `setTypeInfo()` is called.
   
   **Call Chain Analysis**:
   1. Factory creates Sink → `new S3RedshiftSink(readonlyConfig)`
   2. SeaTunnel engine calls `setTypeInfo()` → `new 
FileSinkConfig(pluginConfig, ...)` 
   3. If step 1 does not initialize `pluginConfig`, step 2 will pass in `null`
   
   **Potential Risks**:
   - **NPE**: `FileSinkConfig` constructor will fail immediately
   - **Uncertainty**: Depending on engine call order, may fail intermittently
   
   **Impact Scope**:
   - **Direct impact**: All File-based Sinks using Factory pattern
   - **Impact surface**: Multiple Connectors
   
   **Severity**: **CRITICAL**
   
   **Improvement Suggestion**:
   Ensure `pluginConfig` is initialized in the constructor (returning to Issue 
1's solution).
   
   ---
   
   ## Issue 6: BaseFileSink.preCheckConfig() logic exists but may fail
   
   **Location**: `BaseFileSink.java:54-67`
   
   ```java
   public void preCheckConfig() {
       if 
(pluginConfig.getOptional(FileBaseSinkOptions.SINGLE_FILE_MODE).isPresent()
               && pluginConfig.get(FileBaseSinkOptions.SINGLE_FILE_MODE)
               && jobContext.isEnableCheckpoint()) {
           throw new IllegalArgumentException(...);
       }
       // ...
   }
   ```
   
   **Issue Description**:
   `preCheckConfig()` uses `pluginConfig` to read configuration, but in the new 
Factory pattern, if `pluginConfig` is not initialized in the constructor, this 
method will throw NPE.
   
   **Call Timing**: `setJobContext()` → `preCheckConfig()`
   
   **Potential Risks**:
   - **Validation failure**: Important configuration checks cannot execute
   - **NPE**: If `pluginConfig` is `null`
   
   **Impact Scope**:
   - **Direct impact**: All Connectors using `BaseFileSink`
   - **Impact surface**: Multiple Connectors
   
   **Severity**: **CRITICAL**
   
   **Improvement Suggestion**:
   Ensure `pluginConfig` is initialized during construction (returning to Issue 
1 again).
   
   ---
   
   ### IV. Final Assessment
   
   #### Overall Assessment Conclusion
   
   **❌ Not recommended to merge**
   
   Although this PR aims to advance architectural improvements (migrating to 
Factory pattern), it has multiple **BLOCKER** and **CRITICAL** level issues:
   
   1. **S3RedshiftSink.pluginConfig not initialized** - causes runtime NPE
   2. **BaseFileSink.prepare() deleted but subclasses still call it** - causes 
compilation failure
   3. **Test code not updated** - CI will fail
   4. **Missing configuration validation** - degraded user experience
   
   These issues will result in:
   - S3Redshift Connector completely unusable
   - Possible compilation failures for other HDFS-based Connectors
   - All related tests failing
   
   #### Improvement Suggestion Priority
   
   **P0 (Must fix before merge)**:
   1. Add `this.pluginConfig = readonlyConfig;` in `S3RedshiftSink` constructor
   2. Fix `BaseFileSink` compatibility issue with `prepare()` method of 
subclasses
   3. Update all test files to use new API
   
   **P1 (Strongly recommended to fix)**:
   4. Add configuration validation in `S3RedshiftSink` constructor
   5. Add unit tests for `S3RedshiftFactory`
   6. Verify CI can pass
   
   **P2 (Follow-up improvements)**:
   7. Update `incompatible-changes.md` (if needed)
   8. Add migration guide documentation
   
   #### Architecture Suggestion
   
   This PR is part of Issue #8576, aiming to unify Connector creation methods. 
Suggested approach:
   
   1. **Phased migration**: Do not delete old API all at once, instead:
      - Keep `prepare(Config)` but mark as `@Deprecated`
      - New Connectors must use Factory pattern
      - Old Connectors given clear migration path and deadline
   
   2. **Complete testing**: Ensure all tests pass before migration
   3. **Documentation first**: Provide complete migration guide and examples


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [improve] s3 redshift & hive options [seatunnel]

Reply via email to