yzeng1618 opened a new pull request, #10175:
URL: https://github.com/apache/seatunnel/pull/10175

   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code 
changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
     - Make sure that the pull request corresponds to a [GITHUB 
issue](https://github.com/apache/seatunnel/issues).
     - Name the pull request in the form "[Feature] [component] Title of the 
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix 
typo in README.md doc`.
   -->
   
   ### Purpose of this pull request
   
   This pull request fixes an incorrect type mapping in the Kudu → Doris 
pipeline when Doris tables are auto-created from Kudu catalogs.
   
   Previously, `connector-kudu` used Kudu’s internal `typeSize` for all columns 
as the logical `columnLength`. For `Type.STRING`, `typeSize` is typically `16`. 
When this value was propagated to Doris, Doris sink treated these columns as 
short fixed-length strings and created `CHAR(16)` columns instead of `STRING` 
(unbounded) columns. This breaks real-world Kudu tables where `STRING` columns 
often contain values much longer than 16 characters.
   
   This PR:
   
   1. Updates `KuduCatalog` so that Kudu `STRING` columns no longer use 
`typeSize` as `columnLength`. Only non-`STRING` types keep using `typeSize`.
   2. Adds a unit test for `KuduCatalog` to ensure `STRING` columns are 
reported with `columnLength = null`.
   3. Adds an e2e assertion for Doris catalog to ensure that an upstream 
“unbounded string” column is created as Doris `STRING` (not `CHAR(16)`), and 
`sourceType` is `string`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   **Previous behavior**
   
   - When using Kudu as source and Doris as sink with schema auto-creation 
(`schema_save_mode = "RECREATE_SCHEMA"` or `CREATE_SCHEMA_WHEN_NOT_EXIST`), 
Kudu `STRING` columns were created in Doris as `CHAR(16)`.
   - This could lead to:
     - Truncation or write failures for values longer than 16 characters.
     - Mismatched schema between Kudu and Doris: developers expect `STRING` or 
large `VARCHAR`, but get fixed-length `CHAR(16)`.
   
   **New behavior**
   
   - Kudu `STRING` columns are now exposed from `KuduCatalog` with no logical 
length (`columnLength = null`).
   - Doris sink maps these columns to Doris `STRING` type (internally using 
Doris’ `MAX_STRING_LENGTH`), not `CHAR(16)`.
   - Existing Doris tables are not modified by this PR; the change only affects 
how new tables are auto-created from Kudu catalogs.
   
   
   ### How was this patch tested?
   
   1. **Unit test**
   
      - Added `KuduCatalogTest` in `connector-kudu`:
   
        - Mocks `KuduClient` and `KuduTable` with:
          - One `INT32` column (`id`)
          - One `STRING` column (`val_string`)
        - Calls `KuduCatalog.getTable` and verifies:
          - Non-`STRING` column `id` keeps a non-null `columnLength`.
          - `STRING` column `val_string` has `columnLength == null`.
   
   2. **Doris e2e test**
   
      - Extended `DorisCatalogIT` in `connector-doris-e2e` with 
`testCreateTableWithUnboundedStringColumn`:
        - Builds an upstream `CatalogTable` with:
          - `k1` as `INT` primary key.
          - `k2` as `STRING` with `columnLength = null` (simulating 
KuduCatalog’s behavior).
        - Uses Doris sink `schema_save_mode` to auto-create 
`test.unbounded_string`.
        - Reads the table via `DorisCatalog` and asserts that:
          - Column name is `k2`.
          - Logical type is `BasicType.STRING_TYPE`.
          - `sourceType` is `string` (case-insensitive).
          - If `columnLength` is present, it is greater than 16 (preventing 
regression to `CHAR(16)`).
   
   3. **Existing e2e**
   
      - Ran existing connector Kudu/Doris e2e suites locally to ensure no 
regressions in other scenarios.
   
   
   ### Check list
   
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If necessary, please update `incompatible-changes.md` to describe the 
incompatibility caused by this PR.
   * [x] This PR only touches existing connectors (Kudu, Doris) and does 
**not** add new connector jars:
     * No changes needed for `plugin-mapping.properties`.
     * No changes needed for `seatunnel-dist/pom.xml`.
     * No new CI label required in 
`.github/workflows/labeler/label-scope-conf.yml`.
     * E2E tests have been added/extended under 
`seatunnel-e2e/seatunnel-connector-v2-e2e/`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to