[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming

GitBox Fri, 16 Oct 2020 00:27:24 -0700


spyzzz commented on issue #2175:
URL: https://github.com/apache/hudi/issues/2175#issuecomment-709876261



   After some deep research i finally found something. I first try to do only a 
read and write without any transformation and its was way faster (around 500K 
in 30s) so i tried step by step to find what was the bottleneck and in fact it 
was my avro deserialisation : 
   
   ```
   xxx.readStream.selectExpr("deserialize(value) as message")
   ```
   
   So i manage to find a better solution to deserialise avro message with io 
confluent schema registry 
   
   ```
   xxx.readStream.select(from_avro(col("value"), schema))
   ```
   
   And now i can read and write 500K messages in HUDI in 1.5min. That's way 
better ... 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming

Reply via email to