[jira] [Commented] (ARROW-18001) [Python] parquet.write_table/parquet.ParquetWriter should except a subset of columns

Joris Van den Bossche (Jira) Wed, 12 Oct 2022 06:14:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616413#comment-17616413
 ]


Joris Van den Bossche commented on ARROW-18001:
-----------------------------------------------

Some background: in this specific case when using the _pandas_ (or dask) 
{{to_parquet}} method, the {{schema}} keyword gets passed to 
{{Table.from_pandas}}, and not the actual parquet write methods. 
In general, the type inference happens when converting your python object (eg 
pandas dataframe, or a dict, ..) to an Arrow Table, and once you have such 
table with a fixed schema, writing to Parquet doesn't do type inference anymore 
(since arrow types map to parquet types). 

So I think we should reframe the issue as providing a way to specify the type 
of a subset of columns for {{from_pandas}}.

Doing a small search for other JIRAs, I noticed that at some point in the past 
we actually did support a partial schema (this was accidentally broken at some 
point and then fixed again: ARROW-1125, although in the PR it was already noted 
that we might prefer doing this in another way: 
https://github.com/apache/arrow/pull/790#discussion_r124543809).  
Afterwards, the behaviour was changed again (intentionally) in ARROW-3766, now 
honoring the exact schema as passed 
(https://github.com/apache/arrow/pull/2979#discussion_r234010810)

> [Python] parquet.write_table/parquet.ParquetWriter should except a subset of 
> columns
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18001
>                 URL: https://issues.apache.org/jira/browse/ARROW-18001
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Alenka Frim
>            Priority: Major
>
> This question came up in the GitHub issue: 
> [https://github.com/apache/arrow/issues/14025] and it would be a good 
> improvement to the Parquet part of PyArrow. Haven't found any existing issue 
> and so created a new one.
> h6. Description:
> If a user wants to change a type of one single column when using 
> {{{}parquet.write_table{}}}/{{{}parquet.ParquetWriter{}}} they currently need 
> to specify the schema with all columns included. If a column is not specified 
> in the schema, it will not be included in the parquet file.
> h6. Proposal
> There should be a possibility for {{parquet.ParquetWriter}} excepting a 
> subset of columns in a Schema and infer everything else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18001) [Python] parquet.write_table/parquet.ParquetWriter should except a subset of columns

Reply via email to