Hi all,
I've been messing around with Spark and PyArrow Parquet reading. In my testing
I've found that a Parquet file written by Spark containing a datetime column,
results in different datetimes from Spark and PyArrow.
The attached script demonstrates this.
Output:
Spark Reading the parquet file into a DataFrame:
[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)),
Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]
PyArrow table has dates as UTC (7 hours ahead)
<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>
[
Timestamp('2015-07-06 06:50:00')
]
Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
Date
0 2015-07-06 06:50:00
1 2015-07-06 06:30:00
I would've expected to end up with the same datetime from both readers since
there was no timezone attached at any point. It just a date and time value.
Am I missing anything here? Or is this a bug.
Cheers, Lucas Pickup