[jira] [Commented] (ARROW-9325) [C++][Dataset][Python] ParquetDataset typecast on read

Wes McKinney (Jira) Fri, 31 Jul 2020 14:32:13 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169172#comment-17169172
 ]


Wes McKinney commented on ARROW-9325:
-------------------------------------

Providing a schema to coerce/cast to sounds reasonable to me, and within scope 
for the C++ datasets framework

> [C++][Dataset][Python] ParquetDataset typecast on read
> ------------------------------------------------------
>
>                 Key: ARROW-9325
>                 URL: https://issues.apache.org/jira/browse/ARROW-9325
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 0.17.1
>            Reporter: Carson Eisenach
>            Priority: Major
>
> When reading large Parquet tables, it would be useful to have the option to 
> cast columns to a different type. Consider a large table with double 
> precision types (float64 and int64), the user might prefer to read these in 
> as single precision if double precision is not required. 
> *Current behavior:* One must first read the table and then cast
> *Desired behavior:* provide an additional kwarg that allows the user to 
> specify a target schema. This would be propagated through to 
> ParquetFileFragment, and each fragment can be cast as soon as it is read.
> *Impact:* In cases where the user wants to cast all columns to single 
> precision and the dataset has many partitions, this feature would reduce max 
> memory required by roughly 50%.
> --------------
> I've already implemented a POC using the old Dataset API, and can reimplement 
> using the v2 dataset API, and then submit a patch.
> A couple questions:
> 1. Does this feature fit in with the Arrow roadmap?
> 2. Alternatively, is there a way to accomplish this already in v0.17 that I 
> am missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9325) [C++][Dataset][Python] ParquetDataset typecast on read

Reply via email to