zhangyue19921010 commented on PR #6843: URL: https://github.com/apache/hudi/pull/6843#issuecomment-1315140013
Hey @alexeykudinkin ! Thanks for your response! > Yeah, this PR was put up purely for experimental purposes even before we finalized previous PR landing disruptor which (considerably improved the API!), so i was primarily looking into putting up quick solution i can quickly benchmark against and validate. > > In terms of taking this forward, i have following thoughts: > > * We should try to preserve existing queue implementations, so that users could experiment in their environment and always pick what works best for them. > * This variability of course should not be coming at the expense of performance. > * We should maybe take another holistic look on the executor APIs and try to see if we can simplify them even further (by, say, eliminating what we don't really need) > > Our goal would be to fit in such `SimpleExecutor` into existing framework in a way that would allow all of the aforementioned ones to hold true. Totally agree with u! Let's make it happen. And I will tune https://github.com/apache/hudi/pull/7174 this simpleExecutor pr to simplify APIs even further and remove what we don' t need. > P.S. What surprised me actually was that i was NOT able to reproduce 20% increase in performance for Disruptor on our benchmarks that you folks were seeing in your environment. First issue was due to https://github.com/apache/hudi/pull/7188, but even after addressing it i still don't see an improvement. Would u mind to share more test infos about your test? for example records number, cpu/memory resources and schema maybe if u want. From our experience, we have two kinds of **spark-streaming** ingestion job. 1. For aggregated table which has no unique hoodie keys, so that we use insert or bulkinsert+clustering to do this hudi ingestion. **And our performance test is based on this case.** 2. For raw data, use upsert operation. Still on going using disruptor. More details for our performance test : Hudi version: 0.10.1 Spark Version: 3.0.2 spark streaming Records number per batch(max): 754932000 Schema: 18 columns ``` { "type": "record", "name": "f_order_sa_delivered_hourly_hudi", "namespace": "tv.freewheel.schemas", "fields": [ {"name": "timestamp", "type": "long", "doc": "category=timestamp"}, {"name": "network_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xx_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xx_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xxx_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xxx_id", "type": ["null", "int"], "default": null, "doc": "category=dimension"}, {"name": "xxx_id", "type": ["null", "int"], "default": null, "doc": "category=dimension"}, {"name": "xxxx_visibility", "type": ["null", "string"], "default": null, "doc": "category=dimension"}, {"name": "xxx_owner_id", "type": ["null", "int"], "default": null, "doc": "category=dimension"}, {"name": "sxxx_endpoint_id", "type": ["null", "int"], "default": null, "doc": "category=dimension"}, {"name": "xxxx_order_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xxx_source", "type": ["null", "int"], "default": null, "doc": "category=dimension"}, {"name": "xxx_type", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "content_xxx_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xxx_publisher_id", "type": ["null", "string"], "default": null, "doc": "category=dimension"}, {"name": "xxx_order_id", "type": ["null", "long"], "default": null, "doc": "category=dimension"}, {"name": "xxx_ad_views", "type": ["null", "long"], "default": null, "doc": "category=metric"}, {"name": "revenue", "type": ["null", "double"], "default": null, "doc": "category=metric"} ] } ``` Insert/bulk_insert performance Benchmark between BIMQ (baseline) and Disruptor with same kafka input, resources and configs. **BIMQ: used 7.9 min to finish writing parquets.**  **Disruptor used 5.6 min to finish writing parquets**  In terms of **Case 1** write performance, Disruptor **improved about 29%** from 7.9min to 5.6min -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
