[
https://issues.apache.org/jira/browse/ARROW-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268695#comment-17268695
]
Weston Pace commented on ARROW-10438:
-------------------------------------
Although, on further thought, that would prevent the ability to create `key=`
style partitions. That would seem ok but in the unlucky event some other
system expects the existence of `key=` style partitions it would be pretty
frustrating. Also, one small change, I'm preferring "empty_Fallback" and
"null_fallback" (without the _value) since these are labels and not values.
Another approach could be to introduce a third option "hive_compatibility"
which defaults to True.
||empty_fallback||null_fallback||hive_compatibility||Read null||Write
null||Read empty||Write empty||Allows Data Loss||
|"" (default)|"_HIVE_DEFAULT_PARTITION_" (default)|True
(default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't
happen|Error|False|
|_HIVE_DEFAULT_PARTITION_|"_HIVE_DEFAULT_PARTITION_" (default)|True
(default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't
happen|_HIVE_DEFAULT_PARTITION_|True|
|"" (default)|"_HIVE_DEFAULT_PARTITION_"
(default)|False|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARITION_|""|""|False|
|"XYZ"|"XYZ"|True|XYZ|XYZ|Can't happen|XYZ|True|
|"XYZ"|"XYZ"|False|Raise error on partition create| | | | |
|"XYZ"|"ABC"|True|Raise error on partition create| | | | |
|"XYZ"|"ABC"|False|XYZ|XYZ|ABC|ABC|False|
|"XYZ"|""|False|""|""|XYZ|XYZ|False|
|"" (default)|"XYZ"|True|XYZ|XYZ|Can't happen|Error|False|
Docstrings for the three options could look something like...
empty_fallback - Arrow will use this label when the value is empty. If
hive_compatibility is True then the default behavior will raise an exception to
prevent data loss. If you would like to maintain hive interoperability with
empty strings set this to the same value as null_fallback.
null_fallback - Arrow will use this label when the value is null. By default,
for legacy reasons, this is "_HIVE_DEFAULT_PARTITION_"
hive_compatibility - When this is True Arrow will not allow a separate fallback
value for empty strings. Writing empty strings will produce an error. If you
wish to silently map empty strings to null (normal hive behavior) then you
should also set empty_fallback to match null_fallback. If False, then Arrow
will require the empty fallback and null fallback to be separate values.
This all sounds complicated but it might "just work". The customer probably
won't even be aware of the options until they attempt to write data with empty
strings and then they will get an error. At that point they can agree to the
data loss by changing "empty_fallback" or they can agree to breaking with Hive
by disabling "hive_compatibility".
> [C++][Dataset] Partitioning::Format on nulls
> --------------------------------------------
>
> Key: ARROW-10438
> URL: https://issues.apache.org/jira/browse/ARROW-10438
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 2.0.0
> Reporter: Ben Kietzman
> Assignee: Weston Pace
> Priority: Major
> Fix For: 4.0.0
>
>
> Writing a dataset with null partition keys is currently untested. Ensure the
> behavior is documented and correct
--
This message was sent by Atlassian Jira
(v8.3.4#803005)