westonpace commented on issue #36302:
URL: https://github.com/apache/arrow/issues/36302#issuecomment-1632861575
Yes, the casting will happen after we read the column into memory.
Something like (this is just pseudocode)...
```
# column is a string array
column = read_from_parquet_file(col_index)
desired_type = dataset_schema.types[col_index]
if column.type != desired_type:
# now column is a timestamp array
column = cast(column, desired_type)
...
table = build_table_from_columns(...)
...
# filter happens down here
```
However, if you apply the filter to a dataset, then we are going to try and
use it for pushdown filtering. So if we zoom out a little on the above
pseudocode...
```
metadata = get_parquet_metadata()
for simple_filter_clause in filter: # e.g. things like x > 0
for row_group in metadata.row_groups:
row_group_stats = row_group.statistics
# Casting error is being thrown here
if simple_filter_clause.cannot_match(row_group_stats):
skip_row_group()
# column is a string array
column = read_from_parquet_file(col_index)
desired_type = dataset_schema.types[col_index]
if column.type != desired_type:
# now column is a timestamp array
column = cast(column, desired_type)
...
table = build_table_from_columns(...)
...
# filter happens down here
```
So we cannot use the filter for pushdown directly. I don't think we can
safely cast it. We _could_ just skip this filter (exclude it from pushdown)
and then allow it to be applied later on. So I think it is possible to get
better behavior here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]