[DISCUSS] Acero's ScanNode and Row Indexing across Scans

Rusty Conover Mon, 29 May 2023 14:12:41 -0700

Hi Arrow Team,

I wanted to suggest an improvement regarding Acero's Scan node.
Currently, it provides useful information such as __fragment_index,
__batch_index, __filename, and __last_in_fragment. However, it would
be beneficial to have an additional column that returns an overall
"row index" from the source.


The row index would start from zero and increment for each row
retrieved from the source, particularly in the case of Parquet files.
Is it currently possible to obtain this row index or would expanding
the Scan node's behavior be required?

Having this row index column would be valuable in implementing support
for Iceberg's positional-based delete files, as outlined in the
following link:

https://iceberg.apache.org/spec/#delete-formats

While Iceberg's value-based deletes can already be performed using the
support for anti joins, using a projection node does not guarantee the
row ordering within an Acero graph. Hence, the inclusion of a
dedicated row index column would provide a more reliable solution in
this context.

Thank you for considering this suggestion.

Rusty

[DISCUSS] Acero's ScanNode and Row Indexing across Scans

Reply via email to