[jira] [Created] (ARROW-9325) [Python] ParquetDataset typecast on read

Carson Eisenach (Jira) Sat, 04 Jul 2020 12:39:11 -0700

Carson Eisenach created ARROW-9325:
--------------------------------------

             Summary: [Python] ParquetDataset typecast on read
                 Key: ARROW-9325
                 URL: https://issues.apache.org/jira/browse/ARROW-9325
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
    Affects Versions: 0.17.1
            Reporter: Carson Eisenach



When reading large Parquet tables, it would be useful to have the option to 
cast columns to a different type. Consider a large table with double precision 
types (float64 and int64), the user might prefer to read these in as single 
precision if double precision is not required. 

*Current behavior:* One must first read the table and then cast

*Desired behavior:* provide an additional kwarg that allows the user to specify 
a target schema. This would be propagated through to ParquetFileFragment, and 
each fragment can be cast as soon as it is read.

*Impact:* In cases where the user wants to cast all columns to single precision 
and the dataset has many partitions, this feature would reduce max memory 
required by roughly 50%.

--------------

I've already implemented a POC using the old Dataset API, and can reimplement 
using the v2 dataset API, and then submit a patch.

A couple questions:

1. Does this feature fit in with the Arrow roadmap?

2. Alternatively, is there a way to accomplish this already in v0.17 that I am 
missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9325) [Python] ParquetDataset typecast on read

Reply via email to