On Fri, Nov 1, 2024 at 11:37 AM Martin Andersson <martin.anders...@kambi.com>
wrote:

> Why do you need to change the kafka native timestamp? It is the time the
> message was *produced*, defined by the configuration
> log.message.timestamp.type
> <https://kafka.apache.org/documentation/#brokerconfigs_log.message.timestamp.type>,
> by default the time the record was created client-side.
>

Hi Martin!

Thank you for your feedback! I just checked the producer docs (
https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html#send-org.apache.kafka.clients.producer.ProducerRecord-org.apache.kafka.clients.producer.Callback-
):
> If CreateTime is used by the topic, the timestamp will be the user
provided timestamp or the record send time if the user did not specify a
timestamp for the record.

So the producer is supposed to pick up that value if specified and the
proposed feature enables users of the spark kafka output to do so.

Especially when batch-processing, the send time becomes meaningless. We are
more interested in the event-time, which semantically is the
birth/create-time of the event represented by the  kafka record.


>
> If you need to add a timestamp with other semantics then you should add
> another timestamp, either in the payload or as a header.
>

Yes and no. Event time processing is an intended use case, see, for
example,
https://narayanb.medium.com/unraveling-kafkas-message-ordering-and-timestamps-navigating-the-world-of-streaming-data-b58b1a4a679e

So we think the proposed feature makes the API less opinionated and more
transparent.

Regards,

>
> Regards,
>
> ------------------------------
> *From:* Peter Fischer <pfisc...@wikimedia.org>
> *Sent:* Thursday, October 31, 2024 00:34
> *To:* dev@spark.apache.org <dev@spark.apache.org>
> *Subject:* [Spark SQL] KafkaWriteTask: allow customising timestamp - PR
>
> You don't often get email from pfisc...@wikimedia.org. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Hi!
>
> We wrote a wrapper around the kafka writer to add client-side schema
> validation. In the process we noticed that there was no way to change a
> kafka record's timestamp when writing. So we extended spark-sql-kafka to
> support it and would love to hear your feedback.
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-50160
> GitHub: https://github.com/apache/spark/pull/48695
>
-- 
Peter Fischer (he/him)
Senior Software Engineer, Search Platform
Wikimedia Foundation

Reply via email to