davidzollo opened a new issue, #10356:
URL: https://github.com/apache/seatunnel/issues/10356
## Background
Salesforce is the world's leading Customer Relationship Management (CRM)
platform with over 20% market share globally. It serves as the single source of
truth for customer data, sales opportunities, service cases, and marketing
campaigns across millions of enterprises.
Currently, SeaTunnel lacks native support for Salesforce as a data source,
preventing users from building data pipelines that integrate CRM data with
their data warehouses and analytics platforms.
## Motivation
- **Market Leader**: Salesforce dominates the enterprise CRM space with the
largest user base globally
- **API-Only Access**: Salesforce uses REST/SOAP APIs exclusively - there is
no JDBC support
- **Critical Business Data**: Organizations need to sync CRM data (accounts,
contacts, opportunities, cases, etc.) to data warehouses for analytics
- **Real-Time Integration**: Support for both batch extraction and change
data capture (CDC) via streaming APIs
## Proposed Solution
Implement a dedicated Salesforce Source connector using Salesforce REST API
and Bulk API 2.0:
### Core Features
1. **Multiple API Support**
- REST API for real-time queries and small datasets
- Bulk API 2.0 for large-scale data extraction (millions of records)
- Streaming API for real-time change data capture (CDC)
- Support for SOQL (Salesforce Object Query Language)
2. **Object Support**
- Standard objects (Account, Contact, Lead, Opportunity, Case, etc.)
- Custom objects
- Metadata discovery and schema inference
- Relationship traversal (parent-child, lookup, master-detail)
3. **Data Extraction Modes**
- **Full Snapshot**: Extract complete object data
- **Incremental**: Extract records modified after a specific timestamp
- **CDC**: Real-time streaming of change events via PushTopic or Change
Data Capture
4. **Authentication**
- OAuth 2.0 (Authorization Code, JWT Bearer, Client Credentials)
- Username-Password flow (for development/testing)
- Connected App integration
### Configuration Example
```hocon
source {
Salesforce {
# Authentication
auth_type = "oauth2_jwt"
client_id = "your_connected_app_client_id"
client_secret = "your_client_secret"
username = "[email protected]"
private_key_file = "/path/to/private-key.pem"
# Instance configuration
instance_url = "https://yourinstance.salesforce.com"
api_version = "v59.0"
# Data extraction
object_name = "Account"
extraction_mode = "incremental" # or "full", "cdc"
# Query configuration
soql_query = "SELECT Id, Name, Industry, AnnualRevenue FROM Account
WHERE CreatedDate > LAST_N_DAYS:30"
# or use simple fields selection
fields = ["Id", "Name", "Industry", "AnnualRevenue"]
filter = "CreatedDate > LAST_N_DAYS:30"
# Incremental configuration
incremental_field = "LastModifiedDate"
start_date = "2024-01-01T00:00:00Z"
# Performance tuning
batch_size = 2000
max_retries = 3
request_timeout_ms = 60000
# Schema options
include_deleted = false
flatten_relationships = true
}
}
```
### CDC Configuration Example
```hocon
source {
Salesforce {
auth_type = "oauth2_jwt"
# ... authentication config ...
extraction_mode = "cdc"
object_name = "Opportunity"
# CDC options
cdc_type = "change_data_capture" # or "push_topic"
replay_id = -1 # -1 for new events, -2 for all retained events
# For PushTopic
push_topic_name = "/topic/OpportunityUpdates"
}
}
```
## Expected Benefits
1. **Enterprise Integration**: Enable thousands of Salesforce customers to
use SeaTunnel for data integration
2. **Complete Data Access**: Support all Salesforce objects and relationship
types
3. **High Performance**: Bulk API 2.0 can extract millions of records
efficiently
4. **Real-Time Capabilities**: CDC support enables near-real-time data
synchronization
5. **Ecosystem Growth**: Position SeaTunnel as a viable alternative to
commercial ETL tools like Fivetran, Airbyte Cloud
## Technical Considerations
- **Dependencies**:
- Salesforce REST API client library or custom HTTP client
- OAuth 2.0 library for authentication
- Jackson/Gson for JSON parsing
- **Rate Limiting**:
- Implement exponential backoff for API limits
- Support for concurrent API call tracking
- Configurable request throttling
- **Error Handling**:
- Handle API errors (INVALID_SESSION, LIMIT_EXCEEDED, etc.)
- Retry logic with configurable strategies
- Failed record tracking and logging
- **Testing**:
- Salesforce Developer Edition sandbox for integration tests
- Mock API server for unit tests
- Support for Salesforce scratch orgs in CI/CD
## Implementation Phases
### Phase 1: Basic Support (MVP)
- OAuth 2.0 authentication
- REST API for full snapshot extraction
- Standard object support with SOQL queries
- Basic schema inference
### Phase 2: Enterprise Features
- Bulk API 2.0 for large-scale extraction
- Incremental extraction by modified date
- Custom object support
- Advanced field mapping and transformations
### Phase 3: Real-Time CDC
- Streaming API integration
- Change Data Capture events
- PushTopic support
- Exactly-once semantics
## References
- [Salesforce REST API Developer
Guide](https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/)
- [Bulk API 2.0
Documentation](https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/bulk_api_2_0.htm)
- [Change Data Capture Developer
Guide](https://developer.salesforce.com/docs/atlas.en-us.change_data_capture.meta/change_data_capture/)
- [SOQL and SOSL
Reference](https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/)
## Community Impact
This connector will:
- Make SeaTunnel competitive with commercial ETL tools in the CRM
integration space
- Enable data-driven decision making for sales, marketing, and customer
service teams
- Attract enterprise users who need reliable Salesforce integration
---
**Priority**: High
**Estimated Effort**: Medium-High
**Target Release**: 2.3.14 or 3.0.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]