gaborgsomogyi opened a new pull request #25853: [SPARK-21869][SS] Apply Apache 
Commons Pool to Kafka producer
URL: https://github.com/apache/spark/pull/25853
 
 
   ### What changes were proposed in this pull request?
   
   Kafka producers are now closed when `spark.kafka.producer.cache.timeout` 
reached which could be significant problem when processing big SQL queries. The 
workaround was to increase `spark.kafka.producer.cache.timeout` to a number 
where the biggest SQL query can be finished.
   
   In this PR I've adapted similar solution which already exists on the 
consumer side, namely applies Apache Commons Pool on the producer side as well. 
Main advantages choosing this solution:
   * Producers are not closed until they're in use
   * No manual reference counting needed (which may be error prone)
   * Thread-safe by design
   * Provides jmx connection to the pool where metrics can be fetched
   
   What this PR contains:
   * Introduced producer side parameters to configure pool
   * Renamed `InternalKafkaConsumerPool` to `InternalKafkaConnectorPool` and 
made it abstract
   * Created 2 implementations from it: `InternalKafkaConsumerPool` and 
`InternalKafkaProducerPool`
   * Adapted `CachedKafkaProducer` to use `InternalKafkaProducerPool`
   * Changed `KafkaDataWriter` and `KafkaDataWriteTask` to release producer 
even in failure scenario
   * Added several new tests
   * Extended `KafkaTest` to clear not only producers but consumers as well
   * Renamed `InternalKafkaConsumerPoolSuite` to 
`InternalKafkaConnectorPoolSuite` where only consumer tests are checking the 
behavior (please see comment for reasoning)
   
   What this PR not yet contains(but intended when the main concept is stable):
   * User facing documentation
   
   ### Why are the changes needed?
   Kafka producer closed after 10 minutes (with default settings).
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   Existing + additional unit tests.
   Cluster tests being started.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to