davidzollo opened a new issue, #10358:
URL: https://github.com/apache/seatunnel/issues/10358
## Background
HubSpot is a leading marketing automation and CRM platform used by over
200,000 customers worldwide, particularly popular among small to mid-sized
businesses. It provides comprehensive tools for marketing, sales, customer
service, and content management.
Currently, SeaTunnel lacks native support for HubSpot as a data source,
preventing users from integrating CRM and marketing data with their data
warehouses and analytics platforms.
## Motivation
- **SMB Market Leader**: HubSpot is the dominant choice for small and
medium-sized businesses globally
- **Marketing Automation**: Critical source for marketing campaign data,
lead tracking, and conversion analytics
- **API-Only Access**: HubSpot uses REST API exclusively - no JDBC or SQL
interface available
- **Data-Driven Marketing**: Organizations need to analyze marketing
performance, customer journeys, and ROI
## Proposed Solution
Implement a dedicated HubSpot Source connector using HubSpot REST API v3:
### Core Features
1. **CRM Objects Support**
- **Standard Objects**: Contacts, Companies, Deals, Tickets, Products,
Line Items
- **Custom Objects**: User-defined objects created in HubSpot
- **Activities**: Emails, Calls, Meetings, Tasks, Notes
- **Engagement Data**: Email opens, clicks, form submissions, page views
2. **Marketing Data**
- **Campaigns**: Email campaigns, ad campaigns, social media campaigns
- **Forms**: Form submissions and field values
- **Landing Pages**: Page analytics and conversion data
- **Lists**: Contact lists and segmentation
- **Workflows**: Automation workflow execution data
3. **Data Extraction Modes**
- **Full Snapshot**: Complete object/entity extraction
- **Incremental**: Based on `lastModifiedDate` or `createDate`
- **Association-Based**: Extract related objects (e.g., Contacts with
their Deals)
4. **Authentication**
- **Private App Access Token**: Recommended for server-to-server
integration
- **OAuth 2.0**: For user-context integrations
- **API Key** (Legacy): Support for existing integrations
### Configuration Example
```hocon
source {
HubSpot {
# Authentication
auth_type = "private_app" # or "oauth2", "api_key"
access_token = "pat-na1-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# Object configuration
object_type = "contacts" # or "companies", "deals", "tickets",
"custom_objects"
# For custom objects
custom_object_name = "my_custom_object"
# Extraction mode
extraction_mode = "incremental" # or "full"
# Properties to fetch
properties = [
"firstname",
"lastname",
"email",
"company",
"lifecyclestage",
"createdate",
"lastmodifieddate"
]
# or fetch all properties
fetch_all_properties = true
# Incremental configuration
incremental_field = "lastmodifieddate"
start_date = "2024-01-01T00:00:00Z"
# Associations (relationships)
include_associations = true
association_types = ["contacts_to_companies", "contacts_to_deals"]
# Performance tuning
batch_size = 100
max_concurrent_requests = 5
rate_limit_per_second = 100
request_timeout_ms = 30000
# Filtering
filter_groups = [
{
filters = [
{
property_name = "lifecyclestage"
operator = "EQ"
value = "customer"
},
{
property_name = "createdate"
operator = "GT"
value = "2024-01-01"
}
]
}
]
}
}
```
### Marketing Data Example
```hocon
source {
HubSpot {
auth_type = "private_app"
access_token = "pat-na1-xxxxxxxx"
# Extract email campaign data
object_type = "marketing_emails"
properties = [
"id",
"name",
"subject",
"campaign_name",
"created",
"updated",
"send_time"
]
# Include campaign statistics
include_statistics = true # clicks, opens, bounces, etc.
extraction_mode = "incremental"
incremental_field = "updated"
start_date = "2024-01-01"
}
}
```
### Custom Object Example
```hocon
source {
HubSpot {
auth_type = "private_app"
access_token = "pat-na1-xxxxxxxx"
object_type = "custom_objects"
custom_object_name = "2-12345678" # Custom object schema ID
fetch_all_properties = true
# Include associations with standard objects
include_associations = true
association_types = [
"custom_to_contacts",
"custom_to_companies"
]
}
}
```
## Expected Benefits
1. **SMB Market Access**: Enable thousands of HubSpot users to integrate
their data with SeaTunnel
2. **Marketing Analytics**: Unlock marketing ROI analysis, attribution
modeling, and customer journey analytics
3. **Unified Customer View**: Combine CRM, marketing, and transactional data
in a single data warehouse
4. **Competitive Positioning**: Compete with commercial ETL tools like
Fivetran, Stitch, and Airbyte Cloud
5. **Ecosystem Growth**: Attract marketing teams and growth hackers to
SeaTunnel
## Technical Considerations
### Dependencies
- **HTTP Client**: Use Apache HttpClient or OkHttp for REST API calls
- **JSON Processing**: Jackson or Gson for JSON serialization/deserialization
- **OAuth Library**: If supporting OAuth 2.0 authentication
- **Rate Limiting**: Implement token bucket or sliding window algorithm
### API Characteristics
- **Rate Limits**:
- Standard: 100 requests per 10 seconds
- Professional/Enterprise: Higher limits (150-200 req/10s)
- Need exponential backoff for 429 responses
- **Pagination**:
- Cursor-based pagination (after parameter)
- Maximum 100 records per page
- Need to handle `paging.next.after` token
- **Incremental Extraction**:
- Use `lastmodifieddate` or `createdate` properties
- Filter by date ranges in search API
- Store last successful timestamp in checkpoint
### Error Handling
- **429 Too Many Requests**: Exponential backoff with retry-after header
- **401/403 Authentication**: Fail fast with clear error message
- **400 Bad Request**: Validate property names and filter syntax
- **500 Server Errors**: Retry with exponential backoff
- **Network Errors**: Configurable retry strategy
### Testing
- **HubSpot Developer Account**: Free tier available for testing
- **Test Sandbox**: HubSpot provides sandbox portals for enterprise customers
- **Mock Server**: Create mock API server for unit tests
- **Integration Tests**: Use real HubSpot account with test data
## Implementation Phases
### Phase 1: Core CRM Objects (MVP)
- Private App authentication
- Contacts, Companies, Deals objects
- Full snapshot and incremental extraction
- Basic property selection and filtering
### Phase 2: Marketing Data
- Email campaigns and statistics
- Forms and submissions
- Landing pages and analytics
- Lists and segmentation
### Phase 3: Advanced Features
- Custom objects support
- Associations/relationships
- OAuth 2.0 authentication
- Advanced filtering and search
### Phase 4: Enterprise Features
- Batch property updates (if needed for sink)
- Webhook-based CDC (using HubSpot webhooks)
- Multi-portal support
- Data quality validation
## References
- [HubSpot API
Documentation](https://developers.hubspot.com/docs/api/overview)
- [CRM Objects
API](https://developers.hubspot.com/docs/api/crm/understanding-the-crm)
- [Search API](https://developers.hubspot.com/docs/api/crm/search)
- [Associations
API](https://developers.hubspot.com/docs/api/crm/associations)
- [Marketing Events
API](https://developers.hubspot.com/docs/api/marketing/marketing-events)
- [API Usage
Guidelines](https://developers.hubspot.com/docs/api/usage-details)
## Community Impact
This connector will:
- Make SeaTunnel accessible to the SMB market segment
- Enable data-driven marketing and sales analytics
- Provide an open-source alternative to expensive commercial ETL solutions
- Attract marketing operations professionals to the Apache SeaTunnel
community
---
**Priority**: Medium-High
**Estimated Effort**: Medium
**Target Release**: 2.3.15 or 3.0.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]