TimothyDing opened a new pull request, #4265:
URL: https://github.com/apache/arrow-adbc/pull/4265

   ## Summary
   
   Add a new ADBC driver for 
[Hologres](https://www.alibabacloud.com/product/hologres), Alibaba Cloud's 
real-time data warehouse service built on PostgreSQL. This driver enables 
high-performance columnar data access to Hologres through the standard ADBC 
interface.
   
   ### Components
   
   **C Driver (`c/driver/hologres/`)** — ~20K lines of new code
   
   - `HologresDatabase`: Connection management with automatic 
Hologres/PostgreSQL version detection and type resolver initialization
   - `HologresConnection`: Full ADBC metadata API (`GetInfo`, `GetObjects`, 
`GetTableSchema`, `GetTableTypes`, `GetStatistics`)
   - `HologresStatement`: Query execution, parameterized queries, and two bulk 
ingestion paths:
     - **COPY mode** (default): Standard PostgreSQL `COPY FROM STDIN` binary 
protocol
     - **Stage mode**: Hologres-native stage-based ingestion via Arrow IPC 
upload with configurable concurrency, batch sizing, and file targeting
   - `ArrowCopyReader`: Reads query results via `COPY TO STDOUT` in Arrow IPC 
format (`arrow` or `arrow_lz4`), bypassing row-by-row binary parsing for 
significantly better read performance
   - `TupleReader`: Reads query results via standard PostgreSQL binary `COPY TO 
STDOUT` with nanoarrow-based batch assembly
   - ON_CONFLICT support: `IGNORE` (skip conflicts) and `UPDATE` (upsert) modes 
for both COPY and Stage ingestion
   - Automatic `application_name` tagging (`adbc_hologres_<version>`) for 
server-side observability
   
   **Hologres-specific data type support:**
   - Standard PostgreSQL types: bool, int2/4/8, float4/8, numeric, text, bytea, 
date, time, timestamp, timestamptz, interval, uuid
   - Array types: int2[], int4[], int8[], float4[], float8[], bool[], text[], 
bytea[]
   - Extended types: JSON, JSONB (with version byte prefix), CHAR(n), 
VARCHAR(n), roaringbitmap
   - Type conversions for Stage mode: timestamptz, large_binary, large_string
   
   **Vendored dependency (`c/vendor/nanoarrow/`)** — nanoarrow IPC
   
   - Vendored nanoarrow IPC reader/writer and flatcc runtime for Arrow IPC 
serialization/deserialization, used by both the `ArrowCopyReader` (reading 
Arrow IPC from COPY protocol) and `StageWriter` (serializing Arrow batches for 
Stage upload)
   
   **Python package (`python/adbc_driver_hologres/`)** — ~3.6K lines
   
   - `adbc_driver_hologres`: Python bindings with DBAPI 2.0 support via 
`adbc_driver_manager`
   - Enums: `HologresOnConflict`, `HologresIngestMode`, `StatementOptions`
   - Integration tests covering COPY and Stage modes across all supported types
   - ASV benchmark suites for read/write performance profiling
   
   **Build system:**
   - CMake integration with `ADBC_DRIVER_HOLOGRES` option
   - pkg-config support (`adbc-driver-hologres.pc`)
   - Python setuptools with shared library bundling
   
   ### Key design decisions
   
   1. **Forked from PostgreSQL driver**: Core PostgreSQL utilities 
(`postgres_type.h`, `copy/reader.h`, `copy/writer.h`, etc.) are copied into the 
Hologres driver rather than shared, to allow independent evolution for 
Hologres-specific type handling (JSONB version byte, roaringbitmap, etc.)
   
   2. **Default COPY read format is `arrow_lz4`**: Hologres supports native 
Arrow IPC output in its COPY protocol. The `arrow_lz4` format avoids row-by-row 
binary parsing and leverages LZ4 compression, providing better throughput for 
analytical queries. Falls back to standard binary format via 
`adbc.hologres.copy_format` option.
   
   3. **Stage ingestion for large datasets**: The Stage writer serializes Arrow 
batches into IPC format, uploads them via dedicated FixedFE connections with 
configurable concurrency (default: 4 threads), and commits atomically. This 
path is optimized for bulk loading scenarios where COPY throughput is 
insufficient.
   
   ### Testing
   
   - **C unit tests** (~8.6K lines): Comprehensive coverage for all modules — 
database, connection, statement, COPY reader/writer, Arrow COPY reader, Stage 
writer, bind stream, error handling, PostgreSQL type resolver, and utility 
functions
   - **Python integration tests** (~2.2K lines): End-to-end tests covering 
DBAPI 2.0 compliance, COPY/Stage ingestion for all supported types, ON_CONFLICT 
modes, and edge cases
   - **Python benchmarks**: ASV benchmark suites for read (`binary`, `arrow`, 
`arrow_lz4`) and write (`COPY`, `Stage`) performance at various row counts 
(1K–10M)
   
   ### Configuration options
   
   | Option | Values | Default | Description |
   |--------|--------|---------|-------------|
   | `adbc.hologres.copy_format` | `binary`, `arrow`, `arrow_lz4` | `arrow_lz4` 
| COPY TO STDOUT read format |
   | `adbc.hologres.ingest_mode` | `copy`, `stage` | `copy` | Bulk ingestion 
method |
   | `adbc.hologres.use_copy` | `true`, `false` | `true` | Enable COPY 
optimization for ingestion |
   | `adbc.hologres.on_conflict` | `none`, `ignore`, `update` | `none` | 
Conflict resolution for ingestion |
   | `adbc.hologres.batch_size_hint_bytes` | integer | `16777216` | Target 
batch size hint for reads |
   
   ## Test plan
   
   - [ ] C unit tests pass: `cd build && ctest --test-dir . -R hologres`
   - [ ] Python integration tests pass against a live Hologres instance: `cd 
python/adbc_driver_hologres && pytest tests/`
   - [ ] Build succeeds with `-DADBC_DRIVER_HOLOGRES=ON`
   - [ ] Python package installs and connects successfully


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to