[ 
https://issues.apache.org/jira/browse/ARROW-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268695#comment-17268695
 ] 

Weston Pace commented on ARROW-10438:
-------------------------------------

Although, on further thought, that would prevent the ability to create `key=` 
style partitions.  That would seem ok but in the unlucky event some other 
system expects the existence of `key=` style partitions it would be pretty 
frustrating.  Also, one small change, I'm preferring "empty_Fallback" and 
"null_fallback" (without the _value) since these are labels and not values.

Another approach could be to introduce a third option "hive_compatibility" 
which defaults to True.

 
||empty_fallback||null_fallback||hive_compatibility||Read null||Write 
null||Read empty||Write empty||Allows Data Loss||
|"" (default)|"_HIVE_DEFAULT_PARTITION_" (default)|True 
(default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't 
happen|Error|False|
|_HIVE_DEFAULT_PARTITION_|"_HIVE_DEFAULT_PARTITION_" (default)|True 
(default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't 
happen|_HIVE_DEFAULT_PARTITION_|True|
|"" (default)|"_HIVE_DEFAULT_PARTITION_" 
(default)|False|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARITION_|""|""|False|
|"XYZ"|"XYZ"|True|XYZ|XYZ|Can't happen|XYZ|True|
|"XYZ"|"XYZ"|False|Raise error on partition create| | | | |
|"XYZ"|"ABC"|True|Raise error on partition create| | | | |
|"XYZ"|"ABC"|False|XYZ|XYZ|ABC|ABC|False|
|"XYZ"|""|False|""|""|XYZ|XYZ|False|
|"" (default)|"XYZ"|True|XYZ|XYZ|Can't happen|Error|False|

Docstrings for the three options could look something like...

 

empty_fallback - Arrow will use this label when the value is empty.  If 
hive_compatibility is True then the default behavior will raise an exception to 
prevent data loss.  If you would like to maintain hive interoperability with 
empty strings set this to the same value as null_fallback.

null_fallback - Arrow will use this label when the value is null.  By default, 
for legacy reasons, this is "_HIVE_DEFAULT_PARTITION_"

hive_compatibility - When this is True Arrow will not allow a separate fallback 
value for empty strings.  Writing empty strings will produce an error.  If you 
wish to silently map empty strings to null (normal hive behavior) then you 
should also set empty_fallback to match null_fallback.  If False, then Arrow 
will require the empty fallback and null fallback to be separate values.

 

This all sounds complicated but it might "just work".  The customer probably 
won't even be aware of the options until they attempt to write data with empty 
strings and then they will get an error.  At that point they can agree to the 
data loss by changing "empty_fallback" or they can agree to breaking with Hive 
by disabling "hive_compatibility".

 

> [C++][Dataset] Partitioning::Format on nulls
> --------------------------------------------
>
>                 Key: ARROW-10438
>                 URL: https://issues.apache.org/jira/browse/ARROW-10438
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Writing a dataset with null partition keys is currently untested. Ensure the 
> behavior is documented and correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to