I have implemented Kinesis Connector for Structured Streaming. The code is
available is at https://github.com/qubole/kinesis-sql.
Open Source Jira for the same is SPARK-18165
Design details are mentioned here - https://docs.google.com/presentation/d/
I wrote and ran unit tests to validate the implementation. Apart from that,
I started 3 node spark clusters to read from Kinesis and write to S3 in
parquet format and ran the query for more than a day.
List of Features implemented
- Re-sharding logic to add/remove new/closed shards
- Executors fetch records from Kinesis as part of the incremental job
execution (instead of having a receiver model where few threads are
responsible for reading from Kinesis)
- Various configuration to have fine-grain control depending upon your
I have developed and tested the connector against Spark 2.2.x and
will migrate it to DataSource V2 APIs in some time.
Can anyone help me in reviewing the design/implementation? I would love it
to be part of Spark Distribution.