davidzollo opened a new issue, #10341:
URL: https://github.com/apache/seatunnel/issues/10341
## Description
SeaTunnel currently provides File sink connectors for multiple object stores
(e.g., `S3File`, `OssFile`, `ObsFile`, `CosFile`, `HdfsFile`), but there is no
dedicated sink connector for **Azure Blob Storage**.
Many users run data integration workloads on Azure and need to land files
(CSV/Parquet/ORC/JSON/Binary, etc.) into Azure Blob Storage (and optionally
ADLS Gen2) with the same usability and guarantees as existing File sinks.
I’d like to request a new sink connector to write SeaTunnel output to Azure
Blob Storage.
## Usage Scenario
- Persist batch/stream outputs to Azure Blob Storage for downstream
analytics (Synapse, Databricks, Spark, etc.).
- Store partitioned datasets (e.g., `dt=YYYY-MM-DD/`) in Parquet/ORC/CSV.
- Need exactly-once semantics similar to existing File sinks (2PC + temp
directory + commit/rename).
## Proposed Scope
- Add a new SeaTunnel v2 sink connector: `AzureBlobFile` (under the
`connector-file` family).
- Support Azure endpoints/schemes:
- Azure Blob (WASB): `wasb://` / `wasbs://`
- (Optional) ADLS Gen2 (ABFS): `abfs://` / `abfss://`
- Support common File sink capabilities consistent with existing connectors:
- File formats: `text`, `csv`, `parquet`, `orc`, `json`, `excel`, `xml`,
`binary` (following what other File sinks support)
- Partitioned writes
- Exactly-once via existing 2PC/commit behavior
- Authentication (at least):
- Account key
- SAS token
- (Nice-to-have) AAD OAuth / Managed Identity (depending on feasibility
and community preference)
## Configuration Proposal
Keep options consistent with other File sinks, plus Azure-specific configs.
For advanced scenarios, allow passing Hadoop Azure FS properties as a map
(similar to `S3File`’s `hadoop_s3_properties`), e.g. `hadoop_azure_properties`.
Example (illustrative):
```hocon
sink {
AzureBlobFile {
path = "wasbs://<container>@<account>.blob.core.windows.net/<dir>"
tmp_path =
"wasbs://<container>@<account>.blob.core.windows.net/<tmp_dir>"
# common file sink options
file_format_type = "parquet"
have_partition = true
partition_by = ["dt"]
# Azure auth (one of)
account_name = "<account>"
account_key = "<account_key>"
# or sas_token = "<sas_token>"
# pass-through for Hadoop Azure FS properties (optional)
hadoop_azure_properties = {
# examples:
# "fs.azure.account.key.<account>.blob.core.windows.net" =
"<account_key>"
# "fs.azure.sas.<container>.<account>.blob.core.windows.net" = "<sas>"
}
}
}
```
## Dependencies / Packaging
- Similar to `S3File`, clarify required Hadoop/Azure jars for Spark/Flink
clusters (e.g., `hadoop-azure` + Azure storage dependencies), and ensure
SeaTunnel Engine distribution includes what’s needed (or documents what to add
under `${SEATUNNEL_HOME}/lib`).
## Acceptance Criteria
- `AzureBlobFile` sink connector is available and documented (new doc page
under `docs/en/connector-v2/sink/`).
- Can write files to Azure Blob Storage using account key (baseline) with
the same semantics as other File sinks.
- Includes at least one integration test (recommended: use Azurite)
validating end-to-end file writes.
- Works on SeaTunnel Zeta, and documents Spark/Flink requirements.
## Related issues
- N/A (not found yet).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]