pvary commented on pull request #2217:
URL: https://github.com/apache/iceberg/pull/2217#issuecomment-780517320


   > We can make a tool to do this migration work, just like the snapshot 
expired, iceberg provides Java api and a spark action, but for some large hive 
tables, if it is only done through the Java api, it may be very slow, using the 
engine (spark or flink) can increase the speed of migration. If we only migrate 
a small hive table, we can run the flink program on our own machine, just like 
a test case. What do you think?
   
   Sounds good. My only concern is the dependency on 
`StreamExecutionEnvironment` and other Flink constructs. That would cause Flink 
dependency, even if we do not use Flink for a minimal local migration. Is there 
a way to separate the logic and the parallel execution?
   
   What is the number of partitions/files when the parallelization becomes 
necessary? (Just out of curiosity, you might have some numbers)
   
   Thanks,
   Peter


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to