Hi all,

I'm looking into implementing a Delta Lake [1] source for Apache Beam.

Some of the highlights are listed below.

*Add support for reading data from an existing Delta Lake table (at HEAD,
which could be past the latest checkpoint).
* Support reading from a specific checkpoint (latest or past).
* Use the new Delta Kernel API to implement the source.
* Support parallelized reading via initial splitting and/or dynamic work
rebalancing.
* Support for Beam managed I/O - this will automatically make the connector
available to Python SDK and will also allow runners to manage the version
of the connector.

A design doc is available here: https://s.apache.org/beam-delta-lake-source

Please let me know if you have any comments/questions.

Thanks,
Cham

[1] https://delta.io/

Reply via email to