lszskye opened a new pull request, #99: URL: https://github.com/apache/paimon-cpp/pull/99
## Introduce Table Source Module: Scan, Split and Plan ### Summary Add the `table/source` module which provides the public API and core implementation for table scan, split generation, and plan creation. This module serves as the entry point for batch and streaming read operations, supporting split serialization/deserialization compatible with Java Paimon for cross-language interoperability. ### New Classes #### Public API (`include/paimon/table/source/`) **`Split`** — Base class for input splits used by batch computation engines. Supports binary serialization and deserialization compatible with the Java version, enabling cross-process and cross-language split transmission. A `Split` can be either a `DataSplit` (for direct data file reads) or an `IndexedSplit` (for reads leveraging global indexes). **`DataSplit`** — Extends `Split` for direct data file reading scenarios. Contains `SimpleDataFileMeta` describing each file's path, size, row count, sequence numbers, schema, level, creation time, and optional delete row count. Provides a file list accessor for append table reads. **`Plan`** — Interface representing the result of a `TableScan`. Exposes the generated splits and the associated snapshot ID (or `nullopt` for empty tables). **`TableScan`** — Scanner interface that reads table metadata and produces a `Plan`. Created from a `ScanContext`, it serves as the primary entry point for initiating table scan operations in both batch and streaming modes. **`TableRead`** — Given a `Split` or a list of `Split`s, creates `BatchReader` instances for reading data. Manages memory allocation through a shared `MemoryPool`. **`StartupMode`** — Specifies the startup mode. Supports `Default`, `LatestFull`, `Latest`, `FromSnapshot`, `FromSnapshotFull`, and `FromTimestamp` modes, each with different semantics for batch vs. streaming sources. Provides string conversion and parsing. #### Internal Implementation (`src/paimon/core/table/source/`) **`AbstractTableScan`** — Abstract base class above `FileStoreScan` that provides input split generation logic. Implements the `CreateStartingScanner` method which routes to different `StartingScanner` implementations based on the configured `StartupMode`, handling snapshot lookup, timestamp-based resolution, and tag-based scanning. **`DataSplitImpl`** — Concrete implementation of `DataSplit` with full serialization support. Tracks partition, bucket, file metadata, deletion files, streaming flag, and raw-convertibility. Includes a `Builder` pattern for construction and supports multiple `DataFileMeta` serializer versions (v9, v10, v12, legacy). **`SplitGenerator`** — Generates split groups from `DataFileMeta` lists. Produces `SplitGroup`s that distinguish between raw-convertible groups (directly readable without merge) and non-raw-convertible groups (requiring merge-tree processing). Provides separate entry points for batch and streaming split generation. **`DeletionFile`** — Represents a deletion vector index file associated with a data file. Contains path, offset, length, and optional cardinality (number of deleted rows). Supports versioned serialization/deserialization (v3 and v4+) with list-level serialize/deserialize helpers. **`ScanMode`** — Enum specifying which part of a snapshot to scan: `ALL` (complete data files) or `DELTA` (only newly changed files). **`PlanImpl`** — Concrete implementation of the `Plan` interface. Holds a snapshot ID and a vector of splits, and provides a static `EmptyPlan()` factory method for empty results. ### New Tests - `deletion_file_test.cpp` - `split_generator_test.cpp` - `startup_mode_test.cpp` - `table_scan_test.cpp` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
