svjack opened a new issue #9146: URL: https://github.com/apache/arrow/issues/9146
I review the difference between ParquetDataset and _ParquetDatasetV2 in source code, they have different logic to perform partition filter. The former simply use _filters, the latter combine the conclusion from _filters and _filters_to_expression i think the latter’ s design is more useful, because many expression can be inject into field transformations (such as cast and other column transformations) before perform truly filter. My question is because i can’t cast from string into timestamp in ChunkedArray (field or column truly saved format in table), i can not use this to simplify some logic in filters For example, use this kind of filters [[(“backup_time”, “>”, pd.to_datetime(“2020-01-01”)), ]] where “backup_time” is the partition use time_string (not) well formatted i want to overwrite _filter_to_expression func and use field.cast to transform field type from string to timestamp into perform some filter Even, i can register more complex functions into pyarrow.compute to define many calculations to partitions as custom functions in expression, this is all i need. with the help of expression, i want to promote the _filter_to_expression func from only field value compare into func(field) compare , even field -> field_object compare, just think improve field as spark udf(field). How can i do it gracefully ? because this will help for read performance with partition has complex format. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
