ziudu commented on issue #3344:
URL: https://github.com/apache/hudi/issues/3344#issuecomment-926745203


   sorry for the late reply. 
   
   we used 4-node Hadoop cluster for testing, each node is an ESXi virtual 
machines ([email protected], 32GB, HDD virtual disk),
   
   We tested several different ways for the initial load: 
   
   1) The fastest way is to load the data in Hive format, and then convert them 
to Hudi format. We used DataX to extract data from DB and load it into Hadoop. 
The speed is 80K record per second, the conversion is slower, around 30K per 
second. But it is very easy to parallelize and easy to manage. We tested 2 
DataX instances, the load speed is 150K per second(linear!).  I think the speed 
is limited only by the hardware configuration. Since the result is already 
satisfactory, we didn't test further.
   
   2) We tested streaming upsert with a scala application, the speed is 3K per 
second. This speed is largely enough for a micro service application's 
continuous ingestion process.
   
   3) We tested bulk insert, but it is even slower than upsert (1.5K per 
second). 
   
   So what we are doing now is :
   A. Write an application to scan database metadata and store them into 
Linkedin Datahub.
   B. Write an application to generate various configuration files for dataX, 
kafka, debezium etc from Linkedin Datahub and automatize the initial load and 
continuous ingestion process.
   C. Write a scala application which could subscribe to a range of topics and 
ingest data.
   D. We've chosen method 1 for initial load for the moment. It is not 
beautiful but it's fast.
   
   We haven't tested the scala application's performance if there are lots of 
tables (e.g. > 1000) in a single DB.
   We haven't tested Deltastreamer in Hudi 0.9 yet.  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to