[jira] [Comment Edited] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

Joris Van den Bossche (Jira) Fri, 01 Oct 2021 07:32:14 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423310#comment-17423310
 ]


Joris Van den Bossche edited comment on ARROW-14196 at 10/1/21, 2:31 PM:
-------------------------------------------------------------------------

I wrote this with the latest pyarrow (master):

{code:python}
table = pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list<item: int64>
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list<element: int64>
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 


was (Author: jorisvandenbossche):
I wrote this with the latest pyarrow (master):

{code:python}
table= pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list<item: int64>
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list<element: int64>
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> ------------------------------------------------------------------
>
>                 Key: ARROW-14196
>                 URL: https://issues.apache.org/jira/browse/ARROW-14196
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Parquet
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

Reply via email to