[
https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weiyang Zhao reassigned ARROW-11566:
------------------------------------
Assignee: Weiyang Zhao
> [Python][Parquet] Use pypi condition package to filter partitions in a user
> friendly way
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-11566
> URL: https://issues.apache.org/jira/browse/ARROW-11566
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Weiyang Zhao
> Assignee: Weiyang Zhao
> Priority: Major
>
> I created the pypi condition package to allow user friendly expression of
> conditions. For example, a condition can be:
> (A <= 3 or B != 'b1') and C == ['c1', 'c2']
> For usage details, please see its document at:
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>
> Arbitrary condition objects can be converted to pyarrow's filter by calling
> its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter
> specification.
>
> Furthermore, the condition object be directly used to evaluate partition
> paths. This can replace the current complex filtering codes. (both native and
> python)
> For max efficiency, filtering with the condition object can be done in the
> below ways:
> # read the paths in chunks to keep the memory footprint small;
> # parse the paths to be a pandas dataframe;
> # use condition.query(dataframe) to get the filtered dataframe of path.
> # use numexpr backend for dataframe query for efficiency.
> Please discuss.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)