[jira] [Updated] (ARROW-16240) [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False

Joris Van den Bossche (Jira) Thu, 21 Apr 2022 11:36:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-16240:
------------------------------------------
    Description: 
The {{pq.write_to_dataset}} (legacy implementation) supports the 
{{row_group_size}}/{{chunk_size}} keyword to specify the row group size of the 
written parquet files.

Now that we made {{use_legacy_dataset=False}} the default, this keyword doesn't 
work anymore.

This is because {{dataset.write_dataset(..)}} doesn't support the parquet 
{{row_group_size}} keyword. The {{ParquetFileWriteOptions}} class doesn't 
support this keyword. 

On the parquet side, this is also the only keyword that is not passed to the 
{{ParquetWriter}} init (and thus to parquet's {{WriterProperties}} or 
{{ArrowWriterProperties}}), but to the actual {{write_table}} call. In C++ this 
can be seen at 
https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71


See discussion: 
[https://github.com/apache/arrow/pull/12811#discussion_r845304218]

  was:
{{_create_dataset_for_fragments() }}helper function in test_dataset.py needs to 
be updated to reflect the changes in the {{write_to_dataset}} in ARROW-16122 : 
The default for {{use_legacy_dataset}} keyword will be set to False but the 
{{dataset.write_dataset(..)}} doesn't support the parquet {{row_group_size}} 
keyword.

See discussion: 
[https://github.com/apache/arrow/pull/12811#discussion_r845304218]


> [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset 
> with use_legacy_dataset=False
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16240
>                 URL: https://issues.apache.org/jira/browse/ARROW-16240
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Alenka Frim
>            Priority: Major
>
> The {{pq.write_to_dataset}} (legacy implementation) supports the 
> {{row_group_size}}/{{chunk_size}} keyword to specify the row group size of 
> the written parquet files.
> Now that we made {{use_legacy_dataset=False}} the default, this keyword 
> doesn't work anymore.
> This is because {{dataset.write_dataset(..)}} doesn't support the parquet 
> {{row_group_size}} keyword. The {{ParquetFileWriteOptions}} class doesn't 
> support this keyword. 
> On the parquet side, this is also the only keyword that is not passed to the 
> {{ParquetWriter}} init (and thus to parquet's {{WriterProperties}} or 
> {{ArrowWriterProperties}}), but to the actual {{write_table}} call. In C++ 
> this can be seen at 
> https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71
> See discussion: 
> [https://github.com/apache/arrow/pull/12811#discussion_r845304218]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16240) [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False

Reply via email to