[
https://issues.apache.org/jira/browse/ARROW-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-9325:
-----------------------------------------
Component/s: C++
> [C++][Dataset][Python] ParquetDataset typecast on read
> ------------------------------------------------------
>
> Key: ARROW-9325
> URL: https://issues.apache.org/jira/browse/ARROW-9325
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Affects Versions: 0.17.1
> Reporter: Carson Eisenach
> Priority: Major
> Labels: dataset
>
> When reading large Parquet tables, it would be useful to have the option to
> cast columns to a different type. Consider a large table with double
> precision types (float64 and int64), the user might prefer to read these in
> as single precision if double precision is not required.
> *Current behavior:* One must first read the table and then cast
> *Desired behavior:* provide an additional kwarg that allows the user to
> specify a target schema. This would be propagated through to
> ParquetFileFragment, and each fragment can be cast as soon as it is read.
> *Impact:* In cases where the user wants to cast all columns to single
> precision and the dataset has many partitions, this feature would reduce max
> memory required by roughly 50%.
> --------------
> I've already implemented a POC using the old Dataset API, and can reimplement
> using the v2 dataset API, and then submit a patch.
> A couple questions:
> 1. Does this feature fit in with the Arrow roadmap?
> 2. Alternatively, is there a way to accomplish this already in v0.17 that I
> am missing?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)