[jira] [Created] (SPARK-49069) Spark Structured Streaming unable to send back offset to Kafka consumer group

Siyun Liang (Jira) Tue, 30 Jul 2024 20:31:04 -0700

Siyun Liang created SPARK-49069:
-----------------------------------

             Summary: Spark Structured Streaming unable to send back offset to 
Kafka consumer group
                 Key: SPARK-49069
                 URL: https://issues.apache.org/jira/browse/SPARK-49069
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.5.1
         Environment: Spark Version: 3.5.1


Kafka Version: 0.11+

Spark Kafka package: org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1
            Reporter: Siyun Liang


h3. Summary

Unable to commit Kafka offsets back to Kafka consumer group when using Spark 
Structured Streaming.
h3. Description

I am using Spark Structured Streaming to read data from a Kafka topic, process 
it, and write the results to the console. While the streaming job reads and 
processes data from Kafka correctly, it fails to commit the offsets back to the 
Kafka consumer group. As a result, Kafka does not recognize where the consumer 
left off, causing potential reprocessing of messages.
h3. Configuration

Here is the relevant part of my Spark application configuration:
{code:java}
spark = (
        SparkSession.builder
        .appName("Test-KafkaSparkStructuredStreaming")
        .config("spark.ui.enabled", "true")
        .config("spark.ui.port", test_port)
        .config("spark.jars.packages", 
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")
        .config("spark.dynamicAllocation.enabled", "true")
        .config("spark.dynamicAllocation.minExecutors", "1")
        .config("spark.dynamicAllocation.maxExecutors", "1")
        .config("spark.kafka.consumer.commit.offsets.on.checkpoint", "true")
        .getOrCreate()
    )

df = (
        spark.readStream.format("kafka")
        .option("kafka.bootstrap.servers", kafka_bootstrap_servers)
        .option("subscribe", kafka_topic)
        .option("auto.offset.reset", "earliest")
        .option("group.id", "Test_group")
        .load()
    )

# Spark Streaming Process code ...

query = (
        result_df.writeStream.outputMode("append")
        .format("console")
        .option(
            "checkpointLocation",
            "spark-checkpoint",
        )
        .start()
    )

query.awaitTermination()

spark.stop(){code}
h3. Steps to Reproduce
 # Set up a Kafka topic with data.
 # Configure a Spark Structured Streaming job to read from the Kafka topic 
using the above configuration.
 # Start the streaming job.
 # Observe that data is read and processed correctly, but offsets are not 
committed back to Kafka.

h3. Expected Behavior

Kafka offsets should be committed back to the Kafka consumer group, allowing 
Kafka to track the last consumed offsets.
h3. Actual Behavior

Kafka offsets are not committed back to the Kafka consumer group. As a result, 
Kafka does not recognize where the consumer left off, potentially causing 
reprocessing of messages.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-49069) Spark Structured Streaming unable to send back offset to Kafka consumer group

Reply via email to