jeffreyssmith2nd commented on issue #7925: URL: https://github.com/apache/datafusion/issues/7925#issuecomment-2130106971
The case we're running into in InfluxDB when enabling timezones is slightly different. It is a parquet file with Timestamp without a timezone and then querying with either a timezone or in UTC. The odd part is that we have manually set the schema to be 'UTC', even though the backing parquet file is not. > the issue originates when the schema provided to the ParquetExec and used for planning doesn't match that of the underlying parquet file. I have not been able to reproduce the issue using the datafusion-cli, as there is no way (that I can see) to change the schema. It does behave as I would expect when the schema matches the files. This is the script I'm using to generate some parquet files. ```python import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({ 'time': [ pd.Timestamp(year=2024,month=1,day=1,second=0), pd.Timestamp(year=2024,month=1,day=1,second=1), pd.Timestamp(year=2024,month=1,day=1,second=2), pd.Timestamp(year=2024,month=1,day=1,second=3), ], }) table = pa.Table.from_pandas(df) pq.write_table(table, 'example_no_tz.parquet') df = pd.DataFrame({ 'time': [ pd.Timestamp(year=2024,month=1,day=1,second=0,tz='UTC'), pd.Timestamp(year=2024,month=1,day=1,second=1,tz='UTC'), pd.Timestamp(year=2024,month=1,day=1,second=2,tz='UTC'), pd.Timestamp(year=2024,month=1,day=1,second=3,tz='UTC'), ], }) table = pa.Table.from_pandas(df) pq.write_table(table, 'example_with_tz_utc.parquet') ``` Some sample queries: ``` > SET datafusion.execution.parquet.pushdown_filters=true; 0 row(s) fetched. Elapsed 0.003 seconds. > select * from './example_no_tz.parquet'; +---------------------+ | time | +---------------------+ | 2024-01-01T00:00:00 | | 2024-01-01T00:00:01 | | 2024-01-01T00:00:02 | | 2024-01-01T00:00:03 | +---------------------+ 4 row(s) fetched. Elapsed 0.003 seconds. > select * from './example_no_tz.parquet' where time >= '2024-01-01T00:00:02.000'; +---------------------+ | time | +---------------------+ | 2024-01-01T00:00:02 | | 2024-01-01T00:00:03 | +---------------------+ 2 row(s) fetched. Elapsed 0.009 seconds. > select * from './example_no_tz.parquet' where time >= ('2024-01-01T00:00:02.000' at time zone 'Europe/Brussels'); +---------------------+ | time | +---------------------+ | 2024-01-01T00:00:00 | | 2024-01-01T00:00:01 | | 2024-01-01T00:00:02 | | 2024-01-01T00:00:03 | +---------------------+ 4 row(s) fetched. Elapsed 0.006 seconds. ``` If you perform any queries against the example with timezone, where the predicate is not in the same timezone, you get a type coercion error, which I believe makes sense. ``` > select * from './example_with_tz_utc.parquet'; +----------------------+ | time | +----------------------+ | 2024-01-01T00:00:00Z | | 2024-01-01T00:00:01Z | | 2024-01-01T00:00:02Z | | 2024-01-01T00:00:03Z | +----------------------+ 4 row(s) fetched. Elapsed 0.004 seconds. > select * from './example_with_tz_utc.parquet' where time >= '2024-01-01T00:00:02.000'; +----------------------+ | time | +----------------------+ | 2024-01-01T00:00:02Z | | 2024-01-01T00:00:03Z | +----------------------+ 2 row(s) fetched. Elapsed 0.007 seconds. > select * from './example_with_tz_utc.parquet' where time >= ('2024-01-01T00:00:02.000' at time zone 'Europe/Brussels'); Error during planning: Cannot infer common argument type for comparison operation Timestamp(Nanosecond, Some("UTC")) >= Timestamp(Nanosecond, Some("Europe/Brussels")) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org