Thanks the thoughtful note, Daniel! All of 1-3 looks good to me. Yann/Raymond or other spark usuals here, any thoughts on adding these for 0.12?
0.12 we want to get schema evolution to GA. That's also a very useful suggestion. Tao (author for Schema evolution), any thoughts? On Mon, Apr 25, 2022 at 4:39 PM Daniel Kaźmirski <d.kazmir...@gmail.com> wrote: > Hi, > > I would like to propose a few additions to Spark structured streaming in > Hudi and spark sql improvements. These would make my life easier as a Hudi > user, so this is from user perspective, not sure how about the > implementation side :) > > Spark Structured Streaming: > 1. As a user, I would like to be able to specify starting instant position > in for reading Hudi table streaming query, this is not possible in > structured streaming right now, it starts streaming data from the earliest > available instant or from instant saved in checkpoint. > > 2. In Hudi 0.11 it's possible to fallback to full table scan in absence of > commits afaik, this is used in delta streamer. I would like to have the > same functionality in structured streaming query. > > 3. I would like to be able to limit input rate when reading stream from > Hudi table. I'm thinking about adding maxInstantsPerTrigger/ > maxBytesPerTrigger. Eg I would like to have 100 instants per trigger in my > micro batch. > > Spark SQL: > Since 0.11 we get very flexible schema evolution. Therefore can we as users > automatically evolve schema on MERGE INTO operations? > I guess this should only be supported when we use update set * and insert * > in merge operation. > In case of missing columns, reconcile schema functionality can be used. > > > Best Regards, > Daniel Kaźmirski >