[GitHub] [arrow] ldacey commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

GitBox Wed, 16 Nov 2022 06:48:39 -0800


ldacey commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1317139268


   > > I am actually not fully sure if the code to evaluate pushdown filters 
would actually understand an isin kernel. I think this is handled in 
SimplifyWithGuarantee:
   > 
   
   I have historically been consolidated a single .isin() expression if my 
dataset has 1 partition since I assumed it should be faster/better. 
   
   ```
   Consolidated dataset filter: {'month_id': [202105, 202106, 202107]}
   
   <pyarrow.compute.Expression is_in(month_id, {value_set=int64:[
     202105,
     202106,
     202107
   ], skip_nulls=false})>
   ```
   
   But if it is fine to use OR (and potentially have duplicate filters in cases 
where multiple fragments were saved to 1 partition), then I am fine with that. 
It sounds like a chain of OR filters might be used anyways.
   
   I think we can close this then. The reduce method above will cut out a lot 
of my custom, messy code. I currently have several functions preparing the 
expression for me (the consolidate_dictionary() removes the duplicate 
partitions from ds._get_partition_keys).
   
   ```
       def consolidate_expressions(
           self, expressions: list[ds.Expression], partition_count: int
       ) -> dict | list[dict]:
           """Consolidates the values of a multiple filters into a single list
   
           Args:
               expressions: Partitioning expressions
               partition_count: Number of partition columns
   
           Returns:
               Consolidate dictionary filter such as {'date_id': [20220507, 
20220514]}
           """
           filters = [ds._get_partition_keys(exp) for exp in expressions]
           if partition_count == 1:
               filters = consolidate_dictionary(filters)
           return filters
   
       @staticmethod
       def multiple_partition_filter(
           partitions: list[dict] | dict,
       ) -> list[list[tuple[str, str, Any]]]:
           """Consolidates filters into a single list which has lists of tuples 
such as
               [[("date_id", "==", 20201231)], [("date_id", "==", 20210217)]]
   
           Args:
               partitions: Dataset partition keys
           """
           filters = []
           for part in partitions:
               element = [(k, "==", v) for k, v in part.items()]
               if element not in filters:
                   filters.append(element)
           return filters
   
       @staticmethod
       def single_partition_filter(
           partitions: list[dict] | dict,
       ) -> list[tuple[str, str, Any]]:
           """Consolidates filters into a single list of a tuple such as
               [("date_id", "in", [20201231, 20200101, 20200102, 20200103])]
   
           Args:
               partitions: Dataset partition keys
           """
           return [(k, "in", v) for k, v in partitions.items()]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ldacey commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Reply via email to