[ 
https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao reassigned ARROW-11566:
------------------------------------

    Assignee: Weiyang Zhao

> [Python][Parquet] Use pypi condition package to filter partitions in a user 
> friendly way
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-11566
>                 URL: https://issues.apache.org/jira/browse/ARROW-11566
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Weiyang Zhao
>            Assignee: Weiyang Zhao
>            Priority: Major
>
> I created the pypi condition package to allow user friendly expression of 
> conditions. For example, a condition can be:
> (A <= 3 or B != 'b1') and C == ['c1', 'c2'] 
> For usage details, please see its document at: 
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>  
> Arbitrary condition objects can be converted to pyarrow's filter by calling 
> its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter 
> specification.
>  
> Furthermore, the condition object be directly used to evaluate partition 
> paths. This can replace the current complex filtering codes. (both native and 
> python)
> For max efficiency, filtering with the condition object can be done in the 
> below ways:
>  # read the paths in chunks to keep the memory footprint small;
>  # parse the paths to be a pandas dataframe;
>  # use condition.query(dataframe) to get the filtered dataframe of path.
>  # use numexpr backend for dataframe query for efficiency.
> Please discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to