Re: [Rust][DataFusion] Supporting input_file_name()

Micah Kornfield Wed, 24 Feb 2021 13:58:15 -0800

At least C++ (and the IPC format) a schema can be shared across the many
RecordBatch's which might have different sources.


 It might be useful to define a reserved metadata key (similar to
extension types) so that the data can be interpreted consistently.

On Wed, Feb 24, 2021 at 11:29 AM Andrew Lamb <al...@influxdata.com> wrote:

> I wonder if you could add the file_name as metadata on the `Schema` of the
> RecordBatch rather than the RecordBatch itself? Since every RecordBatch has
> a schema, I don't fully understand the need to add something additional to
> the RecordBatch
>
>
> https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata
>
> On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <seddo...@gmail.com> wrote:
>
> > Hi,
> >
> > One of Apache Spark's very useful SQL functions is the 'input_file_name'
> > SQL function which provides a simple API for identifying the source of a
> > row of data when sourced from a file-based source like Parquet or CSV.
> This
> > is particularly useful for identifying which chunk/partition of a Parquet
> > the row came from and is used heavily by the DeltaLake format to
> determine
> > which files are impacted for MERGE operations.
> >
> > I have built a functional proof-of-concept for DataFusion but it requires
> > modifying the RecordBatch struct to include a 'metadata' struct
> > (RecordBatchMetadata) to carry the source file name attached to each
> batch.
> >
> > It also requires changing the ScalarFunctionImplementation signature to
> > support exposing the metadata (and therefore all the functions).
> >
> > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send +
> > Sync>;
> > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
> > Result<ColumnarValue> + Send + Sync>;
> >
> > These changes have been made in a personal feature branch and are
> available
> > for review (still needs cleaning) but conceptually does anyone have a
> > problem with this API change or have a better proposal?
> >
> > Thanks
> > Mike
> >
>

Re: [Rust][DataFusion] Supporting input_file_name()

Reply via email to