[GitHub] [hudi] ziudu commented on issue #3344: [SUPPORT]Best way to ingest a large number of tables

GitBox Thu, 29 Jul 2021 22:21:25 -0700


ziudu commented on issue #3344:
URL: https://github.com/apache/hudi/issues/3344#issuecomment-889636377



   We made a few POC and found:
   
   1. java-client didn't support MOR either so we would not use it.
   2. Multitabledeltastreamer did not work correctly in continuous mode. It did 
work somehow for MOR in single-run mode, but we preferred not to use it as the 
doc said MOR was not supported. Plus, multitabledeltastreamer runs ingestions 
in serial mode, not in parallel mode. 
   
   For the moment, we will stick to delta streamers, and launch them regularly 
(every 5-10 minutes) to process change data from Debezium or Golden Gate. Let's 
see what will happen for 1000 tables.
   
   However, I don't think it's optimized, as
   - 1 delta streamer needs at least 1 spark executor, with usually 2GB memory. 
Most of our tables have only a very small amount of change data (<1MB) during a 
5-10 minutes' period. We might need a large Hadoop cluster with enough memory 
for data ingestion, transformation and PrestoSQL.  
   - We use spark on Yarn, so it takes 10 seconds to create a Yarn delta 
streamer application, which we think is not optimized either. 
   
   Our final thought:
   Is it possible to write a long-running spark application, which listens to 
multiple data change topics, and writes change data in parallel to Hadoop 
hoodie tables via pyspark data frame? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ziudu commented on issue #3344: [SUPPORT]Best way to ingest a large number of tables

Reply via email to