pvary commented on pull request #2217: URL: https://github.com/apache/iceberg/pull/2217#issuecomment-780517320
> We can make a tool to do this migration work, just like the snapshot expired, iceberg provides Java api and a spark action, but for some large hive tables, if it is only done through the Java api, it may be very slow, using the engine (spark or flink) can increase the speed of migration. If we only migrate a small hive table, we can run the flink program on our own machine, just like a test case. What do you think? Sounds good. My only concern is the dependency on `StreamExecutionEnvironment` and other Flink constructs. That would cause Flink dependency, even if we do not use Flink for a minimal local migration. Is there a way to separate the logic and the parallel execution? What is the number of partitions/files when the parallelization becomes necessary? (Just out of curiosity, you might have some numbers) Thanks, Peter ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
