HeartSaVioR edited a comment on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#issuecomment-472339362 After touching some improvements in other area, I'm now back on this issue again. First of all, I found my bad on previous experiment. I just realized my measurement is totally wrong. `addBatch` is for measuring latency in sink. I think I lost context on the patch and I was in a rush at that time, since it has been not easy to get feedback on this. Sorry to all about making confusion. If I'm not missing here, there're no measurement on latency for fetching data. Actually measuring latency in test env. is also not easy because I've observed acquiring consumer as well as fetching records which is already cached in local Kafka took 0 ms in many times. Instead of dealing with nanoseconds, I just changed the approach to insert some debug log messages to see `how many times KafkaSource fetches requests to Kafka to fetch records`. * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging * actual commit: https://github.com/HeartSaVioR/spark/commit/7ceff19caf5e5e6d3d677b40622c362c60bed627 * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging * actual commit: https://github.com/HeartSaVioR/spark/commit/69985d9a50cd86aaa97e5398ba779c73d3aae7ae Topic and data distribution is follow: ``` truck_speed_events_stream_spark_25151_v1:0:99440 truck_speed_events_stream_spark_25151_v1:1:99489 truck_speed_events_stream_spark_25151_v1:2:397759 truck_speed_events_stream_spark_25151_v1:3:198917 truck_speed_events_stream_spark_25151_v1:4:99484 truck_speed_events_stream_spark_25151_v1:5:497320 truck_speed_events_stream_spark_25151_v1:6:99430 truck_speed_events_stream_spark_25151_v1:7:397887 truck_speed_events_stream_spark_25151_v1:8:397813 truck_speed_events_stream_spark_25151_v1:9:0 ``` Please note that we'll only use smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier. I've also modified test code to apply self-join and also reflect the data distribution. https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2 The query ran 497 batches. I've collected the count of fetch requests on Kafka via below command: ``` grep "DEBUG: fetching data from Kafka consumer" logfile | wc -l ``` and here's the result: * master: 1996 * patch: 1592 In patch's log file I observed it fetches from Kafka per 500 offset, twice (self-join) per partition. In master's log file I observed it fetches from Kafka per 200 offset. Once per partition though. According to the result, I think the patch would also reduce constructing Kafka consumer for this case, but will also experiment it and share the result. cc. @koeninger Could you please revisit the result and see whether the result meets your expectation?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
