hi Lucas, Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge of the Spark timestamp issue (which is a known, and not a bug per se) should be able to give some extra context about this.
My understanding is that when you read timezone-naive data in Spark, it is treated as session-local by the Spark runtime, and so the values that are written to Parquet will change based on the runtime locale. I think you can resolve this by casting the Spark timestamps to UTC to force normalization or setting the runtime locale to GMT/UTC. My apologies if I am mistaken about this. In Arrow, timestamps have two forms: * Time zone naive (where tz=None in Python); there is no notion of UTC or session-localness. * Time zone aware, the integer values are internally normalized to UTC The difficulty is that when you have time zone naive data, Spark may interpret the values differently based on your system locale. This is a pretty serious rough edge in my opinion; we should at minimum add a guide to using Spark and pyarrow together in the pyarrow documentation so that these "gotchas" can be well explained in a single place. - Wes On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup <lucas.tot0.pic...@gmail.com> wrote: > Hi all, > > Very sorry if people already responded to this at: > lucas.pic...@microsoft.com There was an INVALID identifier attached to the > end of the reply address for some reason which may have caused replies to > be lost. > > I've been messing around with Spark and PyArrow Parquet reading. In my > testing I've found that a Parquet file written by Spark containing a > datetime column, results in different datetimes from Spark and PyArrow. > > The attached script demonstrates this. > > Output: > > Spark Reading the parquet file into a DataFrame: > *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), > Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]* > > PyArrow table has dates as UTC (7 hours ahead) > > > > *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[ > Timestamp('2015-07-06 06:50:00')]* > > Pandas DF from pyarrow table has dates as UTC (7 hours ahead) > > > > * Date0 2015-07-06 06:50:001 2015-07-06 06:30:00* > > I would've expected to end up with the same datetime from both readers > since there was no timezone attached at any point. It just a date and time > value. > Am I missing anything here? Or is this a bug. > > I attempted to intercept the timestamp values before pyarrow turns them > into python objects so I could add timezone information which may fix this > issue: > > The goal is to qualify the TimestampValue with a timezone (by creating a > new column in the arrow table based off the previous one). If this can be > done before the Value's are converted to python it may fix the issue I was > having. But it doesn't appear that I can create a new Timestamp type column > with the values from the old timestamp column. > > Here is the code I'm using: > > def chunkedToArray(data): > for chunk in data.iterchunks(): > for value in chunk: > yield value > > def datetimeColumnsAddTimezone(table): > for i, field in enumerate(table.schema): > if field.type == pa.timestamp('ns'): > newField = pa.field(field.name, pa.timestamp('ns', tz='GMT'), > field.nullable, field.metadata) > newArray = pa.array([val for val in > chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT')) > newColumn = pa.Column.from_array(newField, newArray) > table = table.remove_column(i) > table = table.add_column(i, newColumn) > return table > > Cheers, Lucas Pickup