Hi,

One of Apache Spark's very useful SQL functions is the 'input_file_name'
SQL function which provides a simple API for identifying the source of a
row of data when sourced from a file-based source like Parquet or CSV. This
is particularly useful for identifying which chunk/partition of a Parquet
the row came from and is used heavily by the DeltaLake format to determine
which files are impacted for MERGE operations.

I have built a functional proof-of-concept for DataFusion but it requires
modifying the RecordBatch struct to include a 'metadata' struct
(RecordBatchMetadata) to carry the source file name attached to each batch.

It also requires changing the ScalarFunctionImplementation signature to
support exposing the metadata (and therefore all the functions).

From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>;
To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
Result<ColumnarValue> + Send + Sync>;

These changes have been made in a personal feature branch and are available
for review (still needs cleaning) but conceptually does anyone have a
problem with this API change or have a better proposal?

Thanks
Mike

Reply via email to