rdblue commented on issue #179: Use Iceberg tables as sources for Spark 
Structured Streaming
URL: 
https://github.com/apache/incubator-iceberg/issues/179#issuecomment-555696946
 
 
   > we should be able to stream out all currently present data in addition to 
what will arrive later.
   
   Agreed, but I think it would be easier to start from a particular snapshot 
for now, and add the ability to process all existing data as a follow-up.
   
   > we need to take into account that files in the current snapshot could be 
added by already expired snapshots and we might not have metadata for those 
snapshots.
   
   Yes. To start with, I'd recommend ordering partition tuples and processing a 
partition at a time. Partitions are often time-based and correlated with the 
write pattern, so this approach will probably provide a smooth transition to 
from partitions to snapshots.
   
   > We can have a config per operation or a list of allowed operations.
   
   This proposal sounds good to me. For overwrite, it seems like we will need a 
delta format to express what happened to a row as well as to pass the row 
contents.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to