[ https://issues.apache.org/jira/browse/ARROW-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-9325: -------------------------------- Summary: [C++][Dataset][Python] ParquetDataset typecast on read (was: [Python] ParquetDataset typecast on read) > [C++][Dataset][Python] ParquetDataset typecast on read > ------------------------------------------------------ > > Key: ARROW-9325 > URL: https://issues.apache.org/jira/browse/ARROW-9325 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Affects Versions: 0.17.1 > Reporter: Carson Eisenach > Priority: Major > > When reading large Parquet tables, it would be useful to have the option to > cast columns to a different type. Consider a large table with double > precision types (float64 and int64), the user might prefer to read these in > as single precision if double precision is not required. > *Current behavior:* One must first read the table and then cast > *Desired behavior:* provide an additional kwarg that allows the user to > specify a target schema. This would be propagated through to > ParquetFileFragment, and each fragment can be cast as soon as it is read. > *Impact:* In cases where the user wants to cast all columns to single > precision and the dataset has many partitions, this feature would reduce max > memory required by roughly 50%. > -------------- > I've already implemented a POC using the old Dataset API, and can reimplement > using the v2 dataset API, and then submit a patch. > A couple questions: > 1. Does this feature fit in with the Arrow roadmap? > 2. Alternatively, is there a way to accomplish this already in v0.17 that I > am missing? -- This message was sent by Atlassian Jira (v8.3.4#803005)