Reading Parquet datetime column gives different answer in Spark vs PyArrow

Lucas Pickup Fri, 25 Aug 2017 15:23:10 -0700

Hi all,

I've been messing around with Spark and PyArrow Parquet reading. In my testing 
I've found that a Parquet file written by Spark containing a datetime column, 
results in different datetimes from Spark and PyArrow.


The attached script demonstrates this.

Output:
Spark Reading the parquet file into a DataFrame:
[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), 
Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]

PyArrow table has dates as UTC (7 hours ahead)
<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>
[
  Timestamp('2015-07-06 06:50:00')
]
Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
                 Date
0 2015-07-06 06:50:00
1 2015-07-06 06:30:00

I would've expected to end up with the same datetime from both readers since 
there was no timezone attached at any point. It just a date and time value.
Am I missing anything here? Or is this a bug.

Cheers, Lucas Pickup

Reading Parquet datetime column gives different answer in Spark vs PyArrow

Reply via email to