Carson Eisenach created ARROW-9325: -------------------------------------- Summary: [Python] ParquetDataset typecast on read Key: ARROW-9325 URL: https://issues.apache.org/jira/browse/ARROW-9325 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 0.17.1 Reporter: Carson Eisenach
When reading large Parquet tables, it would be useful to have the option to cast columns to a different type. Consider a large table with double precision types (float64 and int64), the user might prefer to read these in as single precision if double precision is not required. *Current behavior:* One must first read the table and then cast *Desired behavior:* provide an additional kwarg that allows the user to specify a target schema. This would be propagated through to ParquetFileFragment, and each fragment can be cast as soon as it is read. *Impact:* In cases where the user wants to cast all columns to single precision and the dataset has many partitions, this feature would reduce max memory required by roughly 50%. -------------- I've already implemented a POC using the old Dataset API, and can reimplement using the v2 dataset API, and then submit a patch. A couple questions: 1. Does this feature fit in with the Arrow roadmap? 2. Alternatively, is there a way to accomplish this already in v0.17 that I am missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)