[GitHub] [spark] HeartSaVioR edited a comment on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

GitBox Wed, 13 Mar 2019 02:15:02 -0700

HeartSaVioR edited a comment on issue #22138: [SPARK-25151][SS] Apply Apache 
Commons Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#issuecomment-472339362
 
 
   After touching some improvements in other area, I'm now back on this issue 
again.
   
   First of all, I found my bad on previous experiment. I just realized my 
measurement is totally wrong. `addBatch` is for measuring latency in sink. I 
think I lost context on the patch and I was in a rush at that time, since it 
has been not easy to get feedback on this. Sorry to all about making confusion.
   
   If I'm not missing here, there're no measurement on latency for fetching 
data. Actually measuring latency in test env. is also not easy because I've 
observed acquiring consumer as well as fetching records which is already cached 
in local Kafka took 0 ms in many times.
   
   Instead of dealing with nanoseconds, I just changed the approach to insert 
some debug log messages to see `how many times KafkaSource fetches requests to 
Kafka to fetch records`.
   
   * master: 
https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging
     * actual commit: 
https://github.com/HeartSaVioR/spark/commit/7ceff19caf5e5e6d3d677b40622c362c60bed627
   * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging
     * actual commit: 
https://github.com/HeartSaVioR/spark/commit/69985d9a50cd86aaa97e5398ba779c73d3aae7ae
   
   Topic and data distribution is follow: 
   
   ```
   truck_speed_events_stream_spark_25151_v1:0:99440
   truck_speed_events_stream_spark_25151_v1:1:99489
   truck_speed_events_stream_spark_25151_v1:2:397759
   truck_speed_events_stream_spark_25151_v1:3:198917
   truck_speed_events_stream_spark_25151_v1:4:99484
   truck_speed_events_stream_spark_25151_v1:5:497320
   truck_speed_events_stream_spark_25151_v1:6:99430
   truck_speed_events_stream_spark_25151_v1:7:397887
   truck_speed_events_stream_spark_25151_v1:8:397813
   truck_speed_events_stream_spark_25151_v1:9:0
   ```
   
   Please note that we'll only use smallest 4 partitions (0, 1, 4, 6) from 
these partitions to finish the query earlier.
   
   I've also modified test code to apply self-join and also reflect the data 
distribution.
   https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2
   
   The query ran 497 batches.
   
   I've collected the count of fetch requests on Kafka via below command:
   
   ```
   grep "DEBUG: fetching data from Kafka consumer" logfile | wc -l
   ```
   
   and here's the result:
   
   * master: 1996
   * patch: 1592
   
   In patch's log file I observed it fetches from Kafka per 500 offset, twice 
(self-join) per partition.
   In master's log file I observed it fetches from Kafka per 200 offset. Once 
per partition though.
   
   According to the result, I think the patch would also reduce constructing 
Kafka consumer for this case, but will also experiment it and share the result.
   
   cc. @koeninger Could you please revisit the result and see whether the 
result meets your expectation?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

Reply via email to