[ 
https://issues.apache.org/jira/browse/ARROW-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15406:
--------------------------------
    Description: 
Currently the default for reading datasets is to do no partitioning.  So given 
the dataset:

/foo=1/part0.parquet
/foo=2/part0.parquet

it will not detect the "foo" partition.  Changing the default to hive should be 
harmless in most cases (the only way it could be a problem is if a user had x=y 
in their directory name and it wasn't intended to be a partition).

This may put us at odds with the default partitioning for writes (ARROW-15407) 
but specifying "partitioning=hive" on a directory partitioned dataset is no 
worse than specifying "partitioning=None" on a  directory partitioned dataset 
which is what we do today.

  was:
Currently the default for reading datasets is to do no partitioning.  So given 
the dataset:

/foo=1/part0.parquet
/foo=2/part0.parquet

it will not detect the "foo" partition.  Changing the default to hive should be 
harmless in most cases (the only way it could be a problem is if a user had x=y 
in their directory name and it wasn't intended to be a partition).

This may put us at odds with the default partitioning for writes (I'm opening a 
separate JIRA for that) but specifying "partitioning=hive" on a directory 
partitioned dataset is no worse than specifying "partitioning=None" on a  
directory partitioned dataset which is what we do today.


> [Python] Change the default read partitioning flavor to hive
> ------------------------------------------------------------
>
>                 Key: ARROW-15406
>                 URL: https://issues.apache.org/jira/browse/ARROW-15406
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Weston Pace
>            Priority: Major
>
> Currently the default for reading datasets is to do no partitioning.  So 
> given the dataset:
> /foo=1/part0.parquet
> /foo=2/part0.parquet
> it will not detect the "foo" partition.  Changing the default to hive should 
> be harmless in most cases (the only way it could be a problem is if a user 
> had x=y in their directory name and it wasn't intended to be a partition).
> This may put us at odds with the default partitioning for writes 
> (ARROW-15407) but specifying "partitioning=hive" on a directory partitioned 
> dataset is no worse than specifying "partitioning=None" on a  directory 
> partitioned dataset which is what we do today.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to