aokolnychyi commented on issue #179: Use Iceberg tables as sources for Spark Structured Streaming URL: https://github.com/apache/incubator-iceberg/issues/179#issuecomment-576574659 @jerryshao, sorry for the delay. I am still not up to speed after holidays. I've started thinking about it and it would be great to leverage #315. However, the last time I checked that PR (a long time ago) I didn't see a way to split a large batch commit into multiple output streaming micro-batches. For example, if a batch job inserts a couple of TBs, you might not want to stream this out of your table in one micro-batch. That got me thinking about keeping some sort of indexing (e.g. manifest name + ordinal position of the last processed file in that manifest) to split large batches. The same logic potentially can be used if we want to stream all of the data from a batch table using Structured Streaming. I haven't thought much about this idea afterwards. Maybe, there is a better way or maybe we can start without that. @jerryshao, I'll be glad if you pick this up. Did you have any specific ideas?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
