Spark structured streaming and Spark SQL improvements

Daniel Kaźmirski Mon, 25 Apr 2022 16:39:57 -0700

Hi,

I would like to propose a few additions to Spark structured streaming in
Hudi and spark sql improvements. These would make my life easier as a Hudi
user, so this is from user perspective, not sure how about the
implementation side :)


Spark Structured Streaming:
1. As a user, I would like to be able to specify starting instant position
in for reading Hudi table streaming query, this is not possible in
structured streaming right now, it starts streaming data from the earliest
available instant or from instant saved in checkpoint.

2. In Hudi 0.11 it's possible to fallback to full table scan in absence of
commits afaik, this is used in delta streamer. I would like to have the
same functionality in structured streaming query.

3. I would like to be able to limit input rate when reading stream from
Hudi table. I'm thinking about adding maxInstantsPerTrigger/
maxBytesPerTrigger. Eg I would like to have 100 instants per trigger in my
micro batch.

Spark SQL:
Since 0.11 we get very flexible schema evolution. Therefore can we as users
automatically evolve schema on MERGE INTO operations?
I guess this should only be supported when we use update set * and insert *
in merge operation.
In case of missing columns, reconcile schema functionality can be used.


Best Regards,
Daniel Kaźmirski

Spark structured streaming and Spark SQL improvements

Reply via email to