Hi, One of Apache Spark's very useful SQL functions is the 'input_file_name' SQL function which provides a simple API for identifying the source of a row of data when sourced from a file-based source like Parquet or CSV. This is particularly useful for identifying which chunk/partition of a Parquet the row came from and is used heavily by the DeltaLake format to determine which files are impacted for MERGE operations.
I have built a functional proof-of-concept for DataFusion but it requires modifying the RecordBatch struct to include a 'metadata' struct (RecordBatchMetadata) to carry the source file name attached to each batch. It also requires changing the ScalarFunctionImplementation signature to support exposing the metadata (and therefore all the functions). From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>; To: <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) -> Result<ColumnarValue> + Send + Sync>; These changes have been made in a personal feature branch and are available for review (still needs cleaning) but conceptually does anyone have a problem with this API change or have a better proposal? Thanks Mike