jorisvandenbossche commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r528586857
##########
File path: python/pyarrow/parquet.py
##########
@@ -877,17 +894,22 @@ def filter_accepts_partition(self, part_key, filter,
level):
f_type = type(f_value)
- if isinstance(f_value, set):
+ if op in {'in', 'not in'}:
+ if not (isinstance(f_value, Container) and
+ isinstance(f_value, Iterable)):
Review comment:
Could maybe also check for `Collection` instead of the combination of
Container and Iterable?
##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2009,6 +2021,37 @@ def test_filters_inclusive_set(tempdir,
use_legacy_dataset):
assert False not in result_df['boolean'].values
[email protected]
+@parametrize_legacy_dataset
+def test_filters_no_partition_found(tempdir, use_legacy_dataset):
+ fs = LocalFileSystem._get_instance()
+ base_path = tempdir
+
+ integer_keys = [0, 1]
+ string_keys = ['a', 'b', 'c']
+ boolean_keys = [True, False]
+ partition_spec = [
+ ['integer', integer_keys],
+ ['string', string_keys],
+ ['boolean', boolean_keys]
+ ]
+
+ df = pd.DataFrame({
+ 'integer': np.array(integer_keys, dtype='i4').repeat(15),
+ 'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
+ 'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
+ 3),
+ }, columns=['integer', 'string', 'boolean'])
+
+ _generate_partition_directories(fs, base_path, partition_spec, df)
+
+ with pytest.raises(ValueError, match=r"No partition is found."):
Review comment:
This only raises an error for the legacy dataset, for the new
implementation, we return an empty table. So you will need to add a `if
use_legacy_dataset: ...` here (see a similar case in the
`test_filters_invalid_pred_op` test just below)
##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -1997,8 +1997,20 @@ def test_filters_inclusive_set(tempdir,
use_legacy_dataset):
dataset = pq.ParquetDataset(
base_path, filesystem=fs,
- filters=[('integer', 'in', {1}), ('string', 'in', {'a', 'b'}),
- ('boolean', 'in', {True})],
+ filters=[('string', 'in', 'ab')],
Review comment:
I am wondering, should we raise an error here instead? Because now this
is interpreted as `("string", "in", {"a", "b"})`, which I am not sure we should
"silently" do (it's not very explicit for the user)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]