svjack opened a new issue #9146:
URL: https://github.com/apache/arrow/issues/9146


   I review the difference between ParquetDataset and _ParquetDatasetV2 in 
source code,
   they have different logic to perform partition filter. 
   The former simply use _filters, the latter combine the conclusion from 
_filters and _filters_to_expression
   i think the latter’ s design is more useful,  because many expression can be 
inject into field transformations
   (such as cast and other column transformations) before perform truly filter.
   My question is because i can’t cast from string into timestamp in 
ChunkedArray (field or column truly
   saved format in table),  i can not use this to simplify some logic in filters
   For example,
   use this kind of filters
   [[(“backup_time”, “>”, pd.to_datetime(“2020-01-01”)), ]]
   where “backup_time” is the partition use time_string  (not) well formatted
   i want to overwrite _filter_to_expression func and use field.cast to 
transform field type from
   string to timestamp into perform some filter
   Even, i can register more complex functions into pyarrow.compute to define 
many calculations
   to partitions as custom functions in expression, this is all i need.
   with the help of expression, i want to promote the  _filter_to_expression 
func from only
   field value compare into func(field) compare , even field -> field_object 
compare,
   just think improve field as spark udf(field).
   How can i do it gracefully ? because this will help for read performance 
with partition has complex
   format.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to