GitHub user A0R0P0I7T edited a discussion: [GSoC 2026 Aspiring Contributor] 
Introducing Myself – Flink Connector for Apache IoTDB 2.X Table Mode

Hi Apache IoTDB Community,

My name is Arpit Saha, and I am an undergrad Information and Communication 
Technology student applying for Google Summer of Code 2026. I am writing to 
introduce myself, share the technical research I have conducted so far, and 
present my current progress toward the Flink Connector for Apache IoTDB 2.X 
Table Mode project. I have been in contact with mentor Mr. Haonan Hou, who has 
been kind enough to guide me toward the relevant resources and codebase. While 
I have not yet made direct contributions to the IoTDB repository, I have been 
investing significant effort into understanding the codebase, the existing 
connector limitations, and the architectural requirements of this project.

---

**Background**

I have prior open source experience with Apache Gravitino, which gave me 
familiarity with Java-based codebases and the Apache contribution workflow. 
This has allowed me to navigate the IoTDB codebase comfortably from the start.

---

**Codebase Analysis**

I studied the existing Flink-IoTDB connectors in `iotdb-extras` along with the 
`flink-tsfile-connector` and identified the following gaps:

- The `flink-iotdb-connector` (tree mode) uses the deprecated `SourceFunction` 
API with a hardcoded SQL string, no split-enumerator architecture, no TAG/FIELD 
awareness, and no fault tolerance — making it insufficient for table mode.
- The `flink-sql-iotdb-connector` delegates execution to Flink's SQL planner 
via a factory/provider pattern and it deepened my understanding of how Flink's 
SQL/Table API layer integrates with IoTDB as a registered table source..
- I also went through the `flink-tsfile-connector` to understand how a more 
complete connector is structured — how the base classes, test infrastructure, 
execution environments, and input formats are organized. This gave me a much 
clearer picture of what a well-structured connector looks like before I began 
designing my own.

The existing `flink-iotdb-connector` was built entirely around tree-mode path 
semantics and the deprecated `SourceFunction` API — it has no awareness of 
IoTDB 2.X's table model, TAG/FIELD structure, or modern streaming source 
architecture, which is precisely the gap this project addresses.

---

**Understanding FLIP-27 and Why It Matters**

Going through the FLIP-27 documentation was the most valuable part of my 
research. The core insight is the clean separation between the 
`SplitEnumerator` — which dynamically generates time-range splits as new IoTDB 
data arrives — and the `SourceReader` — which independently executes bounded 
queries per split and emits records downstream.

Splitting by timestamp boundaries maps naturally onto IoTDB's time-series model 
and enables true parallelism — multiple readers processing different time 
windows simultaneously. Fault tolerance follows directly from this design: 
since each split tracks its own progress independently, the framework 
checkpoints every reader separately and resumes from exactly where it left off 
after a crash — something the older SourceFunction approach simply cannot do 
reliably.

---

**Current Progress**

I built a preliminary prototype using the older `SourceFunction` API first — 
not as the final approach, but to validate my understanding of IoTDB session 
interactions, table queries, and basic execution flow in a controlled setting: 
https://github.com/A0R0P0I7T/Flink-IoTDB-Table-Connector

I have now begun the FLIP-27 based POC, starting with `IoTDBSplit` implementing 
`SourceSplit` with a unique `splitId` for identification, `startTime`/`endTime` 
boundaries as the unit of parallelism, and a `currentOffset` to track reader 
progress within each split for precise fault-tolerant recovery. 

The immediate next steps are building the `SourceReader` with proper `RowData` 
emission and the dynamic `SplitEnumerator` that continuously generates splits 
as new time ranges become available. After that the focus shifts to TAG-based 
filter and projection pushdown, proper schema mapping, and checkpointing 
validation. I will have a working end-to-end POC within the next 24 hours.

---

I am excited about both the database internals and distributed systems aspects 
of this project and welcome any feedback or suggestions from the community.

Thank you for your time.

Arpit Saha

P.S. — I am really enjoying diving deep into the core fundamentals of connector 
architecture through this research — understanding how each piece from split 
design to fault tolerance fits together is proving to be one of the more 
rewarding learning experiences I have had, and it is making me even more 
motivated to build this the right way.


GitHub link: https://github.com/apache/iotdb/discussions/17248

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to