[GitHub] [hudi] zhangyue19921010 commented on pull request #6843: [WIP][HUDI-5023] Avoiding using `BoundedInMemoryExecutor` on the hot-path

GitBox Tue, 15 Nov 2022 02:56:51 -0800


zhangyue19921010 commented on PR #6843:
URL: https://github.com/apache/hudi/pull/6843#issuecomment-1315140013


   Hey @alexeykudinkin ! Thanks for your response!
   > Yeah, this PR was put up purely for experimental purposes even before we 
finalized previous PR landing disruptor which (considerably improved the API!), 
so i was primarily looking into putting up quick solution i can quickly 
benchmark against and validate.
   > 
   > In terms of taking this forward, i have following thoughts:
   > 
   > * We should try to preserve existing queue implementations, so that users 
could experiment in their environment and always  pick what works best for them.
   > * This variability of course should not be coming at the expense of 
performance.
   > * We should maybe take another holistic look on the executor APIs and try 
to see if we can simplify them even further (by, say, eliminating what we don't 
really need)
   > 
   > Our goal would be to fit in such `SimpleExecutor` into existing framework 
in a way that would allow all of the aforementioned ones to hold true.
   
   Totally agree with u! Let's make it happen. And I will tune 
https://github.com/apache/hudi/pull/7174 this simpleExecutor pr to simplify 
APIs even further and remove what we don' t need.
   
   
   > P.S. What surprised me actually was that i was NOT able to reproduce 20% 
increase in performance for Disruptor on our benchmarks that you folks were 
seeing in your environment. First issue was due to 
https://github.com/apache/hudi/pull/7188, but even after addressing it i still 
don't see an improvement.
   
   Would u mind to share more test infos about your test? for example records 
number, cpu/memory resources and schema maybe if u want.
   
   From our experience, we have two kinds of **spark-streaming** ingestion job.
   1. For aggregated table which has no unique hoodie keys, so that we use 
insert or bulkinsert+clustering to do this hudi ingestion. **And our 
performance test is based on this case.**
   2. For raw data, use upsert operation. Still on going using disruptor.
   
   More details for our performance test :
   Hudi version: 0.10.1
   Spark Version: 3.0.2 spark streaming
   Records number per batch(max): 754932000
   Schema:  18 columns
   ```
   {
     "type": "record",
     "name": "f_order_sa_delivered_hourly_hudi",
     "namespace": "tv.freewheel.schemas",
     "fields": [
       {"name": "timestamp", "type": "long", "doc": "category=timestamp"},
       {"name": "network_id", "type": ["null", "long"], "default": null, "doc": 
"category=dimension"},
       {"name": "xx_id", "type": ["null", "long"], "default": null, "doc": 
"category=dimension"},
       {"name": "xx_id", "type": ["null", "long"], "default": null, "doc": 
"category=dimension"},
       {"name": "xxx_id", "type": ["null", "long"], "default": null, "doc": 
"category=dimension"},
       {"name": "xxx_id", "type": ["null", "int"], "default": null, "doc": 
"category=dimension"},
       {"name": "xxx_id", "type": ["null", "int"], "default": null, "doc": 
"category=dimension"},
       {"name": "xxxx_visibility", "type": ["null", "string"], "default": null, 
"doc": "category=dimension"},
       {"name": "xxx_owner_id", "type": ["null", "int"], "default": null, 
"doc": "category=dimension"},
       {"name": "sxxx_endpoint_id", "type": ["null", "int"], "default": null, 
"doc": "category=dimension"},
       {"name": "xxxx_order_id", "type": ["null", "long"], "default": null, 
"doc": "category=dimension"},
       {"name": "xxx_source", "type": ["null", "int"], "default": null, "doc": 
"category=dimension"},
       {"name": "xxx_type", "type": ["null", "long"], "default": null, "doc": 
"category=dimension"},
       {"name": "content_xxx_id", "type": ["null", "long"], "default": null, 
"doc": "category=dimension"},
       {"name": "xxx_publisher_id", "type": ["null", "string"], "default": 
null, "doc": "category=dimension"},
       {"name": "xxx_order_id", "type": ["null", "long"], "default": null, 
"doc": "category=dimension"},
       {"name": "xxx_ad_views", "type": ["null", "long"], "default": null, 
"doc": "category=metric"},
       {"name": "revenue", "type": ["null", "double"], "default": null, "doc": 
"category=metric"}
     ]
   }
   ```
   
   Insert/bulk_insert performance Benchmark between BIMQ (baseline) and 
Disruptor with same kafka input, resources and configs.
   
   **BIMQ: used 7.9 min to finish writing parquets.**
   
![image](https://user-images.githubusercontent.com/69956021/201899574-3fbc0f28-6f42-4ec0-9590-39913ed6af98.png)
   
   **Disruptor used 5.6 min to finish writing parquets**
   
![image](https://user-images.githubusercontent.com/69956021/201899789-4eb4e221-89fe-4bb3-b3af-9d9b78e0eed5.png)
   
   
   In terms of **Case 1** write performance, Disruptor **improved about 29%** 
from 7.9min to 5.6min 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhangyue19921010 commented on pull request #6843: [WIP][HUDI-5023] Avoiding using `BoundedInMemoryExecutor` on the hot-path

Reply via email to