ldacey commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1317139268
> > I am actually not fully sure if the code to evaluate pushdown filters
would actually understand an isin kernel. I think this is handled in
SimplifyWithGuarantee:
>
I have historically been consolidated a single .isin() expression if my
dataset has 1 partition since I assumed it should be faster/better.
```
Consolidated dataset filter: {'month_id': [202105, 202106, 202107]}
<pyarrow.compute.Expression is_in(month_id, {value_set=int64:[
202105,
202106,
202107
], skip_nulls=false})>
```
But if it is fine to use OR (and potentially have duplicate filters in cases
where multiple fragments were saved to 1 partition), then I am fine with that.
It sounds like a chain of OR filters might be used anyways.
I think we can close this then. The reduce method above will cut out a lot
of my custom, messy code. I currently have several functions preparing the
expression for me (the consolidate_dictionary() removes the duplicate
partitions from ds._get_partition_keys).
```
def consolidate_expressions(
self, expressions: list[ds.Expression], partition_count: int
) -> dict | list[dict]:
"""Consolidates the values of a multiple filters into a single list
Args:
expressions: Partitioning expressions
partition_count: Number of partition columns
Returns:
Consolidate dictionary filter such as {'date_id': [20220507,
20220514]}
"""
filters = [ds._get_partition_keys(exp) for exp in expressions]
if partition_count == 1:
filters = consolidate_dictionary(filters)
return filters
@staticmethod
def multiple_partition_filter(
partitions: list[dict] | dict,
) -> list[list[tuple[str, str, Any]]]:
"""Consolidates filters into a single list which has lists of tuples
such as
[[("date_id", "==", 20201231)], [("date_id", "==", 20210217)]]
Args:
partitions: Dataset partition keys
"""
filters = []
for part in partitions:
element = [(k, "==", v) for k, v in part.items()]
if element not in filters:
filters.append(element)
return filters
@staticmethod
def single_partition_filter(
partitions: list[dict] | dict,
) -> list[tuple[str, str, Any]]:
"""Consolidates filters into a single list of a tuple such as
[("date_id", "in", [20201231, 20200101, 20200102, 20200103])]
Args:
partitions: Dataset partition keys
"""
return [(k, "in", v) for k, v in partitions.items()]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]