TimothyDing opened a new pull request, #4265: URL: https://github.com/apache/arrow-adbc/pull/4265
## Summary Add a new ADBC driver for [Hologres](https://www.alibabacloud.com/product/hologres), Alibaba Cloud's real-time data warehouse service built on PostgreSQL. This driver enables high-performance columnar data access to Hologres through the standard ADBC interface. ### Components **C Driver (`c/driver/hologres/`)** — ~20K lines of new code - `HologresDatabase`: Connection management with automatic Hologres/PostgreSQL version detection and type resolver initialization - `HologresConnection`: Full ADBC metadata API (`GetInfo`, `GetObjects`, `GetTableSchema`, `GetTableTypes`, `GetStatistics`) - `HologresStatement`: Query execution, parameterized queries, and two bulk ingestion paths: - **COPY mode** (default): Standard PostgreSQL `COPY FROM STDIN` binary protocol - **Stage mode**: Hologres-native stage-based ingestion via Arrow IPC upload with configurable concurrency, batch sizing, and file targeting - `ArrowCopyReader`: Reads query results via `COPY TO STDOUT` in Arrow IPC format (`arrow` or `arrow_lz4`), bypassing row-by-row binary parsing for significantly better read performance - `TupleReader`: Reads query results via standard PostgreSQL binary `COPY TO STDOUT` with nanoarrow-based batch assembly - ON_CONFLICT support: `IGNORE` (skip conflicts) and `UPDATE` (upsert) modes for both COPY and Stage ingestion - Automatic `application_name` tagging (`adbc_hologres_<version>`) for server-side observability **Hologres-specific data type support:** - Standard PostgreSQL types: bool, int2/4/8, float4/8, numeric, text, bytea, date, time, timestamp, timestamptz, interval, uuid - Array types: int2[], int4[], int8[], float4[], float8[], bool[], text[], bytea[] - Extended types: JSON, JSONB (with version byte prefix), CHAR(n), VARCHAR(n), roaringbitmap - Type conversions for Stage mode: timestamptz, large_binary, large_string **Vendored dependency (`c/vendor/nanoarrow/`)** — nanoarrow IPC - Vendored nanoarrow IPC reader/writer and flatcc runtime for Arrow IPC serialization/deserialization, used by both the `ArrowCopyReader` (reading Arrow IPC from COPY protocol) and `StageWriter` (serializing Arrow batches for Stage upload) **Python package (`python/adbc_driver_hologres/`)** — ~3.6K lines - `adbc_driver_hologres`: Python bindings with DBAPI 2.0 support via `adbc_driver_manager` - Enums: `HologresOnConflict`, `HologresIngestMode`, `StatementOptions` - Integration tests covering COPY and Stage modes across all supported types - ASV benchmark suites for read/write performance profiling **Build system:** - CMake integration with `ADBC_DRIVER_HOLOGRES` option - pkg-config support (`adbc-driver-hologres.pc`) - Python setuptools with shared library bundling ### Key design decisions 1. **Forked from PostgreSQL driver**: Core PostgreSQL utilities (`postgres_type.h`, `copy/reader.h`, `copy/writer.h`, etc.) are copied into the Hologres driver rather than shared, to allow independent evolution for Hologres-specific type handling (JSONB version byte, roaringbitmap, etc.) 2. **Default COPY read format is `arrow_lz4`**: Hologres supports native Arrow IPC output in its COPY protocol. The `arrow_lz4` format avoids row-by-row binary parsing and leverages LZ4 compression, providing better throughput for analytical queries. Falls back to standard binary format via `adbc.hologres.copy_format` option. 3. **Stage ingestion for large datasets**: The Stage writer serializes Arrow batches into IPC format, uploads them via dedicated FixedFE connections with configurable concurrency (default: 4 threads), and commits atomically. This path is optimized for bulk loading scenarios where COPY throughput is insufficient. ### Testing - **C unit tests** (~8.6K lines): Comprehensive coverage for all modules — database, connection, statement, COPY reader/writer, Arrow COPY reader, Stage writer, bind stream, error handling, PostgreSQL type resolver, and utility functions - **Python integration tests** (~2.2K lines): End-to-end tests covering DBAPI 2.0 compliance, COPY/Stage ingestion for all supported types, ON_CONFLICT modes, and edge cases - **Python benchmarks**: ASV benchmark suites for read (`binary`, `arrow`, `arrow_lz4`) and write (`COPY`, `Stage`) performance at various row counts (1K–10M) ### Configuration options | Option | Values | Default | Description | |--------|--------|---------|-------------| | `adbc.hologres.copy_format` | `binary`, `arrow`, `arrow_lz4` | `arrow_lz4` | COPY TO STDOUT read format | | `adbc.hologres.ingest_mode` | `copy`, `stage` | `copy` | Bulk ingestion method | | `adbc.hologres.use_copy` | `true`, `false` | `true` | Enable COPY optimization for ingestion | | `adbc.hologres.on_conflict` | `none`, `ignore`, `update` | `none` | Conflict resolution for ingestion | | `adbc.hologres.batch_size_hint_bytes` | integer | `16777216` | Target batch size hint for reads | ## Test plan - [ ] C unit tests pass: `cd build && ctest --test-dir . -R hologres` - [ ] Python integration tests pass against a live Hologres instance: `cd python/adbc_driver_hologres && pytest tests/` - [ ] Build succeeds with `-DADBC_DRIVER_HOLOGRES=ON` - [ ] Python package installs and connects successfully -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
