davlee1972 commented on issue #38518: URL: https://github.com/apache/arrow/issues/38518#issuecomment-3452824541
**Bumping this issue. The documentation says expressions should work now, but it doesn't..** https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head columns[list](https://docs.python.org/3/library/stdtypes.html#list) of [str](https://docs.python.org/3/library/stdtypes.html#str), default [None](https://docs.python.org/3/library/constants.html#None) The columns to project. This can be **a list of column names** to include (order and duplicates will be preserved), or **a dictionary with {new_column_name: expression} values** for more advanced projections. The list of columns or **expressions** may use the **special fields** __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and **__filename** (the name of the source file or a description of the source fragment). The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema. **Sample code pulling data using a list of columns vs a dictionary of expressions.** <img width="1352" height="816" alt="Image" src="https://github.com/user-attachments/assets/f709c65f-121e-4d06-a665-7bd7a7573096" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
