[GitHub] [arrow] westonpace commented on issue #36302: [Python][Parquet] dataset filter doesn't apply custom schema to parquet file if the file has schema with metadata

via GitHub Wed, 12 Jul 2023 09:34:14 -0700


westonpace commented on issue #36302:
URL: https://github.com/apache/arrow/issues/36302#issuecomment-1632861575


   Yes, the casting will happen after we read the column into memory.  
Something like (this is just pseudocode)...
   
   ```
   # column is a string array
   column = read_from_parquet_file(col_index)
   desired_type = dataset_schema.types[col_index]
   if column.type != desired_type:
     # now column is a timestamp array
     column = cast(column, desired_type)
   ...
   table = build_table_from_columns(...)
   ...
   # filter happens down here
   ```
   
   However, if you apply the filter to a dataset, then we are going to try and 
use it for pushdown filtering.  So if we zoom out a little on the above 
pseudocode...
   
   ```
   metadata = get_parquet_metadata()
   for simple_filter_clause in filter: # e.g. things like x > 0
     for row_group in metadata.row_groups:
       row_group_stats = row_group.statistics
       # Casting error is being thrown here
       if simple_filter_clause.cannot_match(row_group_stats):
         skip_row_group()
   
   # column is a string array
   column = read_from_parquet_file(col_index)
   desired_type = dataset_schema.types[col_index]
   if column.type != desired_type:
     # now column is a timestamp array
     column = cast(column, desired_type)
   ...
   table = build_table_from_columns(...)
   ...
   # filter happens down here
   ```
   
   So we cannot use the filter for pushdown directly.  I don't think we can 
safely cast it.  We _could_ just skip this filter (exclude it from pushdown) 
and then allow it to be applied later on.  So I think it is possible to get 
better behavior here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #36302: [Python][Parquet] dataset filter doesn't apply custom schema to parquet file if the file has schema with metadata

Reply via email to