[PR] feat: introduce table scan, split and plan [paimon-cpp]

via GitHub Wed, 17 Jun 2026 20:14:11 -0700


lszskye opened a new pull request, #99:
URL: https://github.com/apache/paimon-cpp/pull/99


   ## Introduce Table Source Module: Scan, Split and Plan
   
   ### Summary
   
   Add the `table/source` module which provides the public API and core 
implementation for table scan, split generation, and plan creation. This module 
serves as the entry point for batch and streaming read operations, supporting 
split serialization/deserialization compatible with Java Paimon for 
cross-language interoperability.
   
   ### New Classes
   
   #### Public API (`include/paimon/table/source/`)
   
   **`Split`** — Base class for input splits used by batch computation engines. 
Supports binary serialization and deserialization compatible with the Java 
version, enabling cross-process and cross-language split transmission. A 
`Split` can be either a `DataSplit` (for direct data file reads) or an 
`IndexedSplit` (for reads leveraging global indexes).
   
   **`DataSplit`** — Extends `Split` for direct data file reading scenarios. 
Contains `SimpleDataFileMeta` describing each file's path, size, row count, 
sequence numbers, schema, level, creation time, and optional delete row count. 
Provides a file list accessor for append table reads.
   
   **`Plan`** — Interface representing the result of a `TableScan`. Exposes the 
generated splits and the associated snapshot ID (or `nullopt` for empty tables).
   
   **`TableScan`** — Scanner interface that reads table metadata and produces a 
`Plan`. Created from a `ScanContext`, it serves as the primary entry point for 
initiating table scan operations in both batch and streaming modes.
   
   **`TableRead`** — Given a `Split` or a list of `Split`s, creates 
`BatchReader` instances for reading data. Manages memory allocation through a 
shared `MemoryPool`. 
   
   **`StartupMode`** — Specifies the startup mode. Supports `Default`, 
`LatestFull`, `Latest`, `FromSnapshot`, `FromSnapshotFull`, and `FromTimestamp` 
modes, each with different semantics for batch vs. streaming sources. Provides 
string conversion and parsing.
   
   #### Internal Implementation (`src/paimon/core/table/source/`)
   
   **`AbstractTableScan`** — Abstract base class above `FileStoreScan` that 
provides input split generation logic. Implements the `CreateStartingScanner` 
method which routes to different `StartingScanner` implementations based on the 
configured `StartupMode`, handling snapshot lookup, timestamp-based resolution, 
and tag-based scanning.
   
   **`DataSplitImpl`** — Concrete implementation of `DataSplit` with full 
serialization support. Tracks partition, bucket, file metadata, deletion files, 
streaming flag, and raw-convertibility. Includes a `Builder` pattern for 
construction and supports multiple `DataFileMeta` serializer versions (v9, v10, 
v12, legacy).
   
   **`SplitGenerator`** — Generates split groups from `DataFileMeta` lists. 
Produces `SplitGroup`s that distinguish between raw-convertible groups 
(directly readable without merge) and non-raw-convertible groups (requiring 
merge-tree processing). Provides separate entry points for batch and streaming 
split generation.
   
   **`DeletionFile`** — Represents a deletion vector index file associated with 
a data file. Contains path, offset, length, and optional cardinality (number of 
deleted rows). Supports versioned serialization/deserialization (v3 and v4+) 
with list-level serialize/deserialize helpers.
   
   **`ScanMode`** — Enum specifying which part of a snapshot to scan: `ALL` 
(complete data files) or `DELTA` (only newly changed files).
   
   **`PlanImpl`** — Concrete implementation of the `Plan` interface. Holds a 
snapshot ID and a vector of splits, and provides a static `EmptyPlan()` factory 
method for empty results.
   
   ### New Tests
   
   - `deletion_file_test.cpp`
   - `split_generator_test.cpp`
   - `startup_mode_test.cpp`
   - `table_scan_test.cpp`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat: introduce table scan, split and plan [paimon-cpp]

Reply via email to