ziudu commented on issue #3344: URL: https://github.com/apache/hudi/issues/3344#issuecomment-889636377
We made a few POC and found: 1. java-client didn't support MOR either so we would not use it. 2. Multitabledeltastreamer did not work correctly in continuous mode. It did work somehow for MOR in single-run mode, but we preferred not to use it as the doc said MOR was not supported. Plus, multitabledeltastreamer runs ingestions in serial mode, not in parallel mode. For the moment, we will stick to delta streamers, and launch them regularly (every 5-10 minutes) to process change data from Debezium or Golden Gate. Let's see what will happen for 1000 tables. However, I don't think it's optimized, as - 1 delta streamer needs at least 1 spark executor, with usually 2GB memory. Most of our tables have only a very small amount of change data (<1MB) during a 5-10 minutes' period. We might need a large Hadoop cluster with enough memory for data ingestion, transformation and PrestoSQL. - We use spark on Yarn, so it takes 10 seconds to create a Yarn delta streamer application, which we think is not optimized either. Our final thought: Is it possible to write a long-running spark application, which listens to multiple data change topics, and writes change data in parallel to Hadoop hoodie tables via pyspark data frame? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
