Ruiii-w commented on issue #10406:
URL: https://github.com/apache/seatunnel/issues/10406#issuecomment-3814775260
> Thanks for your interest in contributing! This feature request is valid.
Currently, `COPY` is only supported on the **Sink** side
(`JdbcSinkOptions.use_copy_statement`, `CopyManagerProxy.copyInMethod`), while
the **Source** side exclusively uses standard SELECT queries
(`ChunkSplitter.java`, `JdbcInputFormat.java`).
>
> The PostgreSQL dialect already optimizes with cursor mode
(`PostgresDialect.creatPreparedStatement`), but `COPY ... TO STDOUT` would
provide better performance for large exports.
>
> **Key implementation points to consider:**
>
> * Extend `CopyManagerProxy` to support `copyOut` (returns `InputStream`)
> * Add a new Source option (e.g., `use_copy_statement` for Source)
> * Handle `COPY` limitations: no support for custom `query` (only
`table_path`), complex WHERE clauses, or interactions with `partition_column`
>
> **Questions to clarify:**
>
> 1. Should `COPY` fall back to SELECT when custom `query` is configured?
> 2. Should `COPY` be compatible with parallel reads via `partition_column`,
or mutually exclusive?
To ensure compatibility with the existing JDBC configuration and make `COPY`
compatible with the current JDBC sharding mechanism, we can try incorporating
`query`, `partition_column`, and `where_condition` into the SQL construction of
`COPY` for more flexible applications.
Additionally, the choice between using `COPY` and `SELECT` can be directly
controlled by the `use_copy_statement` configuration for the Source.
### Scenario 1: Both `partition_column` and `where_condition` are configured
When `partition_column` and `where_condition` are configured, the generated
SQL will be as follows:
```sql
COPY (
SELECT * FROM (
SELECT * FROM "schema"."table_name"
WHERE "partition_column" >= <start_value>
AND "partition_column" < <end_value>
) tmp <where_condition>
) TO STDOUT WITH [BINARY|CSV];
```
### Scenario 2: Only `query` is enabled
When only the custom `query` is enabled, the generated SQL will be
simplified to:
```sql
COPY (
SELECT * FROM (
<query>
) tmp
) TO STDOUT WITH [BINARY|CSV];
```
### Performance Verification
Through practical experiments and measurements, i have confirmed that this
form of `COPY`-based SQL performs **better than standard `SELECT` queries**.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]