[
https://issues.apache.org/jira/browse/ARROW-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474912#comment-17474912
]
Weston Pace commented on ARROW-15310:
-------------------------------------
So I'd vote we do something similar to what I proposed in ARROW-10485. In
python this would mean adding a {{partitioning_flavor}} option to
{{pyarrow.dataset.dataset}} and default it to {{"hive"}}.
If set to "hive" then the python code would supply a
{{HivePartitioningFactory}} to the {{FileSystemDatasetFactory}}. This would
basically be the analogue of the {{partitioning_flavor}} option in
{{pyarrow.dataset.write_dataset}}.
If both {{partitioning}} and {{partitioning_flavor}} are set then
{{partitioning}} would take precedence (although an error would be acceptable).
This should make it clear to the user:
* By default we will try and interpret directories as hive partitioned folders.
* If you don't want that default there is an easy way to change it.
> [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is
> parsing an actually hive-style file path?
> -----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15310
> URL: https://issues.apache.org/jira/browse/ARROW-15310
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
>
> When you have a hive-style partitioned dataset, with our current
> {{dataset(..)}} API, it's relatively easy to mess up the inferred
> partitioning and get confusing results.
> For example, if you specify the partitioning field names with
> {{partitioning=[...]}} (which is not needed for hive style since those are
> inferred), we actually assume you want directory partitioning. This
> DirectoryPartitioning will then parse the hive-style file paths and take the
> full "key=value" as the data values for the field.
> And then, doing a filter can result in a confusing empty result (because
> "value" doesn't match "key=value").
> I am wondering if we can't relatively cheaply detect this case, and eg give
> an informative warning about this to the user.
> Basically what happens is this:
> {code:python}
> >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
> >>> part.parse("part=a")
> <pyarrow.dataset.Expression (part == "part=a")>
> {code}
> If the parsed value is a string that contains a "=" (and in this case also
> contains the field name), that is I think a clear sign that (in the large
> majority of cases) the user is doing something wrong.
> I am not fully sure where and at what stage the check could be done though.
> Doing it for every path in the dataset might be too costly.
> ----
> Illustrative code example:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> import pathlib
> ## constructing a small dataset with 1 hive-style partitioning level
> basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
> basedir.mkdir(exist_ok=True)
> (basedir / "part=a").mkdir(exist_ok=True)
> (basedir / "part=b").mkdir(exist_ok=True)
> table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
> pq.write_table(table1, basedir / "part=a" / "data.parquet")
> table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
> pq.write_table(table2, basedir / "part=b" / "data.parquet")
> {code}
> Reading as is (not specifying a partitioning, so default to no partitioning)
> will at least give an error about a missing field:
> {code: python}
> >>> dataset = ds.dataset(basedir)
> >>> dataset.to_table(filter=ds.field("part") == "a")
> ...
> ArrowInvalid: No match for FieldRef.Name(part) in a: int64
> {code}
> But specifying the partitioning field name (which currently gets (silently)
> interpreted as directory partitioning) gives a confusing empty result:
> {code:python}
> >>> dataset = ds.dataset(basedir, partitioning=["part"])
> >>> dataset.to_table(filter=ds.field("part") == "a")
> pyarrow.Table
> a: int64
> b: int64
> part: string
> ----
> a: []
> b: []
> part: []
> {code}
> This filter doesn't work because the values in the "part" column are not "a"
> but "part=a":
> {code:python}
> >>> dataset.to_table().to_pandas()
> a b part
> 0 1 1 part=a
> 1 2 2 part=a
> 2 3 3 part=a
> 3 4 1 part=b
> 4 5 2 part=b
> 5 6 3 part=b
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)