rdblue commented on issue #24738: [WIP][SPARK-23098][SQL] Migrate Kafka Batch source to v2. URL: https://github.com/apache/spark/pull/24738#issuecomment-502764757 @HeartSaVioR, thanks for the additional context. I can see what the issue is now. This is definitely a case that we need to update the DSv2 design for, but I suggest we think about the problem slightly differently. For file-based sources, Spark supports a special function, [`input_file_name()`](https://spark.apache.org/docs/2.3.1/api/sql/#input_file_name) that returns the file that each row was read from. I think the extra columns for Kafka are basically the same thing: extra metadata about a row. I'd like to solve this problem by coming up with a good way to expose that optional per-row metadata from a source. That way, we can use this solution to implement `input_file_name()` for file-based sources, expose additional Kafka metadata, and also expose additional metadata from Cassandra (@RussellSpitzer has been interested in the same area). There are two approaches: first, we could use functions that get pushed down to the source, like `input_file_name()`. Second, we could use a [virtual column](https://en.wikipedia.org/wiki/Virtual_column) approach like Presto. In Presto, columns like `$name` can be requested from a source but are not considered part of the table schema. What do you think about either one of these approaches to expose the additional partition, offset, and timestamp metadata?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
