Hi Rusty, At first glance, I think adding a row_index column would make sense. To be clear, this would be an index within a file / fragment, not across multiple files, which don't necessarily have a known ordering in Acero (IIUC).
However, another approach would be to take a mask argument in the Parquet reader. We may wish to do this anyways for support for using predicate pushdown with Parquet's page index. While Arrow C++ hasn't yet implemented predicate pushdown on page index (right now just supports row groups), Arrow Rust has and provides an API to pass in a mask to support it. The reason for this implementation is described in the blog post "Querying Parquet with Millisecond Latency" [1], under "Page Pruning". The RowSelection struct API is worth a look [2]. I'm not yet sure which would be preferable, but I think adopting a similar pattern to what the Rust community has done may be wise. It's possible that row_index is easy to implement while the mask will take time, in which case row_index makes sense as an interim solution. Best, Will Jones [1] https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ [2] https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html On Mon, May 29, 2023 at 2:12 PM Rusty Conover <ru...@conover.me.invalid> wrote: > Hi Arrow Team, > > I wanted to suggest an improvement regarding Acero's Scan node. > Currently, it provides useful information such as __fragment_index, > __batch_index, __filename, and __last_in_fragment. However, it would > be beneficial to have an additional column that returns an overall > "row index" from the source. > > The row index would start from zero and increment for each row > retrieved from the source, particularly in the case of Parquet files. > Is it currently possible to obtain this row index or would expanding > the Scan node's behavior be required? > > Having this row index column would be valuable in implementing support > for Iceberg's positional-based delete files, as outlined in the > following link: > > https://iceberg.apache.org/spec/#delete-formats > > While Iceberg's value-based deletes can already be performed using the > support for anti joins, using a projection node does not guarantee the > row ordering within an Acero graph. Hence, the inclusion of a > dedicated row index column would provide a more reliable solution in > this context. > > Thank you for considering this suggestion. > > Rusty >