[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

Joris Van den Bossche (Jira) Fri, 01 Oct 2021 07:06:26 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423302#comment-17423302
 ]


Joris Van den Bossche commented on ARROW-14196:
-----------------------------------------------

Some relevant quotes from the ARROW-11497:

> [Micah] I think the main reason it isn't enabled by default is it breaks 
> round trips for arrow data.  This could potentially be fixed on the reader 
> side as well.  

>> [Antoine] Perhaps we could convert the field name at the Arrow<->Parquet 
>> boundary.
> [Micah] This should be possible but it potentially needs another flag.  I 
> think in the short term plumbing the additional flag through to python makes 
> sense and we can figure out a longer term solution if this becomes a larger 
> problem.

>> [Antoine] It should simply be the default (and obviously right) behaviour. 
>> Am I missing something?
> [Micah] Backwards compatibility?  It might be possible to make some 
> inferences (haven't thought about it deeply).  But I think if we were reading 
> a conforming java produced parquet file then we would get different column 
> names if we transformed on the border (maybe there can be some rules around 
> Arrow metadata being present).  I think we can make the default to be 
> conforming behavior, but we should give users some level of control to 
> preserve the old behavior.

---

I am not super familiar, but so the simplest option is to just switch the 
default of the {{compliant_nested_types}} option in ArrowWriterProperties. What 
would be the (possible backwards incompatible) consequences of that?  
We would start writing a different Parquet file (but actually following the 
spec). But I suppose that also when reading such a file, you would then get a 
different name for the sub-lists (which can impact selecting a sublist with a 
nested field reference?) 
To avoid having a breaking change on the read path, we could by default also 
convert the names at the Parquet->Arrow boundary (like the 
{{compliant_nested_types}}  option already does on the Arrow->Parquet 
boundary). However, doing that can _also_ break code for people currently 
already reading compliant parquet files ..

> [C++][Parquet] Default to compliant nested types in Parquet writer
> ------------------------------------------------------------------
>
>                 Key: ARROW-14196
>                 URL: https://issues.apache.org/jira/browse/ARROW-14196
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Parquet
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

Reply via email to