Joris Van den Bossche created ARROW-15310:
---------------------------------------------
Summary: [C++][Python][Dataset] Detect (and warn?) when
DirectoryPartitioning is parsing an actually hive-style file path?
Key: ARROW-15310
URL: https://issues.apache.org/jira/browse/ARROW-15310
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Joris Van den Bossche
When you have a hive-style partitioned dataset, with our current
{{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning
and get confusing results.
For example, if you specify the partitioning field names with
{{partitioning=[...]}} (which is not needed for hive style since those are
inferred), we actually assume you want directory partitioning. This
DirectoryPartitioning will then parse the hive-style file paths and take the
full "key=value" as the data values for the field.
And then, doing a filter can result in a confusing empty result (because
"value" doesn't match "key=value").
I am wondering if we can't relatively cheaply detect this case, and eg give an
informative warning about this to the user.
Basically what happens is this:
{code:python}
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")
<pyarrow.dataset.Expression (part == "part=a")>
{code}
If the parsed value is a string that contains a "=" (and in this case also
contains the field name), that is I think a clear sign that (in the large
majority of cases) the user is doing something wrong.
I am not fully sure where and at what stage the check could be done though.
Doing it for every path in the dataset might be too costly.
----
Illustrative code example:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pathlib
## constructing a small dataset with 1 hive-style partitioning level
basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)
(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)
table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")
table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")
{code}
Reading as is (not specifying a partitioning, so default to no partitioning)
will at least give an error about a missing field:
{code: python}
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64
{code}
But specifying the partitioning field name (which currently gets (silently)
interpreted as directory partitioning) gives a confusing empty result:
{code:python}
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string
----
a: []
b: []
part: []
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)