findepi commented on issue #13261: URL: https://github.com/apache/datafusion/issues/13261#issuecomment-2458092019
> use a function like `ROW_NUMBER` to figure out the positions of rows. It would be great if the parquet reader machinery could expose this information directly instead. The SQL-level approach would work only if the source file isn't filtered: no predicates, no pre-existing deletion vectors, etc. I agree with the assessment that the information must be coning from the file reader itself. > ### Describe the solution you'd like > I'm not sure what a good API would look like here, but one idea is that the parquet reader could expose some new option that enables row position information to be returned as some special column name. I.E. > > ```rust > let ctx = SessionContext::new_with_config(SessionConfig::default().set_bool("datafusion.execution.parquet.include_row_position", true)) > let record_batches = ctx.read_parquet("foo.parquet").filter(filters).select(PARQUET_ROW_POSITION).collect(); > // record batches now contains the indexes of rows in "foo.parquet" that match the provided filters. > ``` i like the syntax @alamb can this be handled with some form of a hidden column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org