[
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423302#comment-17423302
]
Joris Van den Bossche commented on ARROW-14196:
-----------------------------------------------
Some relevant quotes from the ARROW-11497:
> [Micah] I think the main reason it isn't enabled by default is it breaks
> round trips for arrow data. This could potentially be fixed on the reader
> side as well.
>> [Antoine] Perhaps we could convert the field name at the Arrow<->Parquet
>> boundary.
> [Micah] This should be possible but it potentially needs another flag. I
> think in the short term plumbing the additional flag through to python makes
> sense and we can figure out a longer term solution if this becomes a larger
> problem.
>> [Antoine] It should simply be the default (and obviously right) behaviour.
>> Am I missing something?
> [Micah] Backwards compatibility? It might be possible to make some
> inferences (haven't thought about it deeply). But I think if we were reading
> a conforming java produced parquet file then we would get different column
> names if we transformed on the border (maybe there can be some rules around
> Arrow metadata being present). I think we can make the default to be
> conforming behavior, but we should give users some level of control to
> preserve the old behavior.
---
I am not super familiar, but so the simplest option is to just switch the
default of the {{compliant_nested_types}} option in ArrowWriterProperties. What
would be the (possible backwards incompatible) consequences of that?
We would start writing a different Parquet file (but actually following the
spec). But I suppose that also when reading such a file, you would then get a
different name for the sub-lists (which can impact selecting a sublist with a
nested field reference?)
To avoid having a breaking change on the read path, we could by default also
convert the names at the Parquet->Arrow boundary (like the
{{compliant_nested_types}} option already does on the Arrow->Parquet
boundary). However, doing that can _also_ break code for people currently
already reading compliant parquet files ..
> [C++][Parquet] Default to compliant nested types in Parquet writer
> ------------------------------------------------------------------
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Parquet
> Reporter: Joris Van den Bossche
> Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to
> have the list columns follow the Parquet specification), and ARROW-11497
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At
> some point we should flip this.", and in ARROW-11497 there was also some
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)