Carson Eisenach created ARROW-9325:
--------------------------------------
Summary: [Python] ParquetDataset typecast on read
Key: ARROW-9325
URL: https://issues.apache.org/jira/browse/ARROW-9325
Project: Apache Arrow
Issue Type: New Feature
Components: Python
Affects Versions: 0.17.1
Reporter: Carson Eisenach
When reading large Parquet tables, it would be useful to have the option to
cast columns to a different type. Consider a large table with double precision
types (float64 and int64), the user might prefer to read these in as single
precision if double precision is not required.
*Current behavior:* One must first read the table and then cast
*Desired behavior:* provide an additional kwarg that allows the user to specify
a target schema. This would be propagated through to ParquetFileFragment, and
each fragment can be cast as soon as it is read.
*Impact:* In cases where the user wants to cast all columns to single precision
and the dataset has many partitions, this feature would reduce max memory
required by roughly 50%.
--------------
I've already implemented a POC using the old Dataset API, and can reimplement
using the v2 dataset API, and then submit a patch.
A couple questions:
1. Does this feature fit in with the Arrow roadmap?
2. Alternatively, is there a way to accomplish this already in v0.17 that I am
missing?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)