[GitHub] [incubator-iceberg] aokolnychyi commented on issue #179: Use Iceberg tables as sources for Spark Structured Streaming

GitBox Tue, 21 Jan 2020 00:37:35 -0800

aokolnychyi commented on issue #179: Use Iceberg tables as sources for Spark 
Structured Streaming
URL: 
https://github.com/apache/incubator-iceberg/issues/179#issuecomment-576574659
 
 
   @jerryshao, sorry for the delay. I am still not up to speed after holidays.
   
   I've started thinking about it and it would be great to leverage #315. 
However, the last time I checked that PR (a long time ago) I didn't see a way 
to split a large batch commit into multiple output streaming micro-batches. For 
example, if a batch job inserts a couple of TBs, you might not want to stream 
this out of your table in one micro-batch. That got me thinking about keeping 
some sort of indexing (e.g. manifest name + ordinal position of the last 
processed file in that manifest) to split large batches. The same logic 
potentially can be used if we want to stream all of the data from a batch table 
using Structured Streaming. I haven't thought much about this idea afterwards. 
Maybe, there is a better way or maybe we can start without that.
   
   @jerryshao, I'll be glad if you pick this up. Did you have any specific 
ideas?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] aokolnychyi commented on issue #179: Use Iceberg tables as sources for Spark Structured Streaming

Reply via email to