suxiaogang223 opened a new issue, #60505:
URL: https://github.com/apache/doris/issues/60505

   ## Background
   
   Currently, Doris supports querying Paimon system tables (e.g., 
`table$snapshots`, `table$partitions`) through the Table-Valued Function (TVF) 
path. The execution flow is:
   
   ```
   SQL: SELECT * FROM catalog.db.table$snapshots
     → MetadataScanNode → TVF → JNI Scanner → Paimon SDK
   ```
   
   This approach works well for **metadata-oriented system tables** (snapshots, 
manifests, partitions, etc.) that return small result sets from metadata files.
   
   However, for **data-oriented system tables** like `binlog`, `audit_log`, and 
`ro` (read-optimized), the current JNI-based approach has significant 
performance limitations:
   
   1. **Data Source Difference**: Unlike metadata tables, 
`binlog`/`audit_log`/`ro` read actual data files (ORC/Parquet), not metadata 
files
   2. **JNI Overhead**: Large data volumes suffer from JNI 
serialization/deserialization overhead
   3. **Missing Native Optimizations**: Cannot leverage Doris's native 
vectorized ORC/Parquet readers and predicate pushdown optimizations
   
   ## Paimon System Table Classification
   
   | Category | System Tables | Data Source | Current Path | Proposed Path |
   |----------|--------------|-------------|--------------|---------------|
   | Metadata | snapshots, manifests, partitions, schemas, options, tags, 
branches, files, buckets, etc. | Metadata files | TVF + JNI | Keep as-is |
   | Data | **binlog**, **audit_log**, **ro** | Actual data files (ORC/Parquet) 
| TVF + JNI | **Native Read** |
   
   ## Paimon binlog/audit_log Implementation Analysis
   
   In Paimon's codebase, `BinlogTable` and `AuditLogTable` are special:
   
   - They implement `DataTable` interface (not just `ReadonlyTable`)
   - They **wrap the underlying `FileStoreTable`** and read actual data files
   - They reuse `DataSplit` from the source table
   - The underlying storage format is ORC/Parquet
   
   Key implementation files in Paimon:
   - `paimon-core/.../table/system/BinlogTable.java` - extends AuditLogTable
   - `paimon-core/.../table/system/AuditLogTable.java` - wraps FileStoreTable
   
   This makes them ideal candidates for native reading in Doris.
   
   ## Proposal
   
   Refactor the system table query path to support native reading for 
data-oriented Paimon system tables.
   
   ### Phase 1: FE Refactoring
   
   **Goal**: Route `binlog`/`audit_log`/`ro` queries through `PaimonScanNode` 
instead of `MetadataScanNode`.
   
   1. **Extend `SysTable` interface**
      - Add `useNativeTablePath()` method to distinguish execution paths
      - Add `getSchema()` method for native path schema retrieval
   
   2. **Create `PaimonSysExternalTable`**
      - New class extending `PaimonExternalTable`
      - Wraps source table with system table type
      - Returns Paimon `BinlogTable`/`AuditLogTable` instance from 
`getPaimonTable()`
   
   3. **Modify `BindRelation`**
      - Check `useNativeTablePath()` before creating TVF relation
      - Create `LogicalFileScan` for native-path system tables
   
   4. **Adapt `PaimonScanNode`**
      - Support `PaimonSysExternalTable` as scan source
      - Generate splits from system table's `ReadBuilder`
      - Pass system table type to BE via `TPaimonFileDesc`
   
   ### Phase 2: BE Native Reader Implementation
   
   **Goal**: Implement native readers for binlog/audit_log with row 
transformation logic.
   
   1. **Extend Thrift definitions**
      - Add `sys_table_type` field to `TPaimonFileDesc`
      - Add `force_keep_delete` and `is_streaming` flags
   
   2. **Implement `PaimonAuditLogReader`**
      - Wrap native ORC/Parquet reader
      - Add `rowkind` column based on delete vectors
      - Support `forceKeepDelete` semantics
   
   3. **Implement `PaimonBinlogReader`**
      - Extend `PaimonAuditLogReader`
      - Convert columns to array types
      - Pack UPDATE_BEFORE/UPDATE_AFTER pairs (streaming mode)
   
   ## Expected Architecture
   
   ```
   SQL: SELECT * FROM catalog.db.table$binlog
                       ↓
       BindRelation (useNativeTablePath=true)
                       ↓
       PaimonSysExternalTable (wraps source table)
                       ↓
       PaimonScanNode (reuse existing logic)
         - Serialize BinlogTable
         - Get DataSplits from BinlogTable
         - Set sys_table_type="binlog"
                       ↓
       BE: PaimonBinlogReader
         - Native ORC/Parquet reading
         - Add rowkind column
         - Array conversion for binlog
         - Changelog packing (streaming)
   ```
   
   ## Benefits
   
   1. **Performance**: Leverage native vectorized readers, avoid JNI overhead
   2. **Predicate Pushdown**: Native readers support efficient filtering
   3. **Resource Efficiency**: Reduced memory copying between Java and C++
   4. **Consistency**: Unified execution path with regular Paimon tables
   5. **Scalability**: Better performance for large-scale CDC scenarios
   
   ## Tasks
   
   - [ ] Phase 1.1: Extend `SysTable` interface with `useNativeTablePath()`
   - [ ] Phase 1.2: Implement `PaimonSysExternalTable` class
   - [ ] Phase 1.3: Modify `BindRelation` to support native path routing
   - [ ] Phase 1.4: Adapt `PaimonScanNode` for system tables
   - [ ] Phase 2.1: Extend Thrift definitions for system table params
   - [ ] Phase 2.2: Implement `PaimonAuditLogReader`
   - [ ] Phase 2.3: Implement `PaimonBinlogReader`
   - [ ] Phase 2.4: Add regression tests
   
   ## Related
   
   - Paimon BinlogTable: 
https://github.com/apache/paimon/blob/master/paimon-core/src/main/java/org/apache/paimon/table/system/BinlogTable.java
   - Paimon AuditLogTable: 
https://github.com/apache/paimon/blob/master/paimon-core/src/main/java/org/apache/paimon/table/system/AuditLogTable.java
   
   ## Use Case
   
   This feature is particularly valuable for:
   - Real-time CDC pipelines reading Paimon binlog
   - Data auditing scenarios with large audit_log tables
   - Read-optimized queries on Paimon tables with `ro` system table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to