[jira] [Commented] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?

Weston Pace (Jira) Wed, 12 Jan 2022 11:38:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474912#comment-17474912
 ]


Weston Pace commented on ARROW-15310:
-------------------------------------

So I'd vote we do something similar to what I proposed in ARROW-10485.  In 
python this would mean adding a {{partitioning_flavor}} option to 
{{pyarrow.dataset.dataset}} and default it to {{"hive"}}.

If set to "hive" then the python code would supply a 
{{HivePartitioningFactory}} to the {{FileSystemDatasetFactory}}.  This would 
basically be the analogue of the {{partitioning_flavor}} option in 
{{pyarrow.dataset.write_dataset}}.

If both {{partitioning}} and {{partitioning_flavor}} are set then 
{{partitioning}} would take precedence (although an error would be acceptable).

This should make it clear to the user:

 * By default we will try and interpret directories as hive partitioned folders.
 * If you don't want that default there is an easy way to change it.

> [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is 
> parsing an actually hive-style file path?
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15310
>                 URL: https://issues.apache.org/jira/browse/ARROW-15310
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> When you have a hive-style partitioned dataset, with our current 
> {{dataset(..)}} API, it's relatively easy to mess up the inferred 
> partitioning and get confusing results. 
> For example, if you specify the partitioning field names with 
> {{partitioning=[...]}} (which is not needed for hive style since those are 
> inferred), we actually assume you want directory partitioning. This 
> DirectoryPartitioning will then parse the hive-style file paths and take the 
> full "key=value" as the data values for the field.  
> And then, doing a filter can result in a confusing empty result (because 
> "value" doesn't match "key=value").
> I am wondering if we can't relatively cheaply detect this case, and eg give 
> an informative warning about this to the user. 
> Basically what happens is this:
> {code:python}
> >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
> >>> part.parse("part=a")
> <pyarrow.dataset.Expression (part == "part=a")>
> {code}
> If the parsed value is a string that contains a "=" (and in this case also 
> contains the field name), that is I think a clear sign that (in the large 
> majority of cases) the user is doing something wrong.
> I am not fully sure where and at what stage the check could be done though. 
> Doing it for every path in the dataset might be too costly.
> ----
> Illustrative code example:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> import pathlib
> ## constructing a small dataset with 1 hive-style partitioning level
> basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
> basedir.mkdir(exist_ok=True)
> (basedir / "part=a").mkdir(exist_ok=True)
> (basedir / "part=b").mkdir(exist_ok=True)
> table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
> pq.write_table(table1, basedir / "part=a" / "data.parquet")
> table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
> pq.write_table(table2, basedir / "part=b" / "data.parquet")
> {code}
> Reading as is (not specifying a partitioning, so default to no partitioning) 
> will at least give an error about a missing field:
> {code: python}
> >>> dataset = ds.dataset(basedir)
> >>> dataset.to_table(filter=ds.field("part") == "a")
> ...
> ArrowInvalid: No match for FieldRef.Name(part) in a: int64
> {code}
> But specifying the partitioning field name (which currently gets (silently) 
> interpreted as directory partitioning) gives a confusing empty result:
> {code:python}
> >>> dataset = ds.dataset(basedir, partitioning=["part"])
> >>> dataset.to_table(filter=ds.field("part") == "a")
> pyarrow.Table
> a: int64
> b: int64
> part: string
> ----
> a: []
> b: []
> part: []
> {code}
> This filter doesn't work because the values in the "part" column are not "a" 
> but "part=a":
> {code:python}
> >>> dataset.to_table().to_pandas()
>    a  b    part
> 0  1  1  part=a
> 1  2  2  part=a
> 2  3  3  part=a
> 3  4  1  part=b
> 4  5  2  part=b
> 5  6  3  part=b
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?

Reply via email to