[GitHub] [spark] rdblue commented on issue #24738: [WIP][SPARK-23098][SQL] Migrate Kafka Batch source to v2.

GitBox Mon, 17 Jun 2019 09:53:20 -0700

rdblue commented on issue #24738: [WIP][SPARK-23098][SQL] Migrate Kafka Batch 
source to v2.
URL: https://github.com/apache/spark/pull/24738#issuecomment-502764757
 
 
   @HeartSaVioR, thanks for the additional context. I can see what the issue is 
now.
   
   This is definitely a case that we need to update the DSv2 design for, but I 
suggest we think about the problem slightly differently. For file-based 
sources, Spark supports a special function, 
[`input_file_name()`](https://spark.apache.org/docs/2.3.1/api/sql/#input_file_name)
 that returns the file that each row was read from. I think the extra columns 
for Kafka are basically the same thing: extra metadata about a row.
   
   I'd like to solve this problem by coming up with a good way to expose that 
optional per-row metadata from a source. That way, we can use this solution to 
implement `input_file_name()` for file-based sources, expose additional Kafka 
metadata, and also expose additional metadata from Cassandra (@RussellSpitzer  
has been interested in the same area).
   
   There are two approaches: first, we could use functions that get pushed down 
to the source, like `input_file_name()`. Second, we could use a [virtual 
column](https://en.wikipedia.org/wiki/Virtual_column) approach like Presto. In 
Presto, columns like `$name` can be requested from a source but are not 
considered part of the table schema.
   
   What do you think about either one of these approaches to expose the 
additional partition, offset, and timestamp metadata?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rdblue commented on issue #24738: [WIP][SPARK-23098][SQL] Migrate Kafka Batch source to v2.

Reply via email to