see https://issues.apache.org/jira/browse/ARROW-1425
On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Lucas, > > Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge > of the Spark timestamp issue (which is a known, and not a bug per se) > should be able to give some extra context about this. > > My understanding is that when you read timezone-naive data in Spark, > it is treated as session-local by the Spark runtime, and so the values > that are written to Parquet will change based on the runtime locale. I > think you can resolve this by casting the Spark timestamps to UTC to > force normalization or setting the runtime locale to GMT/UTC. My > apologies if I am mistaken about this. > > In Arrow, timestamps have two forms: > > * Time zone naive (where tz=None in Python); there is no notion of UTC > or session-localness. > * Time zone aware, the integer values are internally normalized to UTC > > The difficulty is that when you have time zone naive data, Spark may > interpret the values differently based on your system locale. This is > a pretty serious rough edge in my opinion; we should at minimum add a > guide to using Spark and pyarrow together in the pyarrow documentation > so that these "gotchas" can be well explained in a single place. > > - Wes > > On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup > <lucas.tot0.pic...@gmail.com> wrote: >> Hi all, >> >> Very sorry if people already responded to this at: >> lucas.pic...@microsoft.com There was an INVALID identifier attached to the >> end of the reply address for some reason which may have caused replies to >> be lost. >> >> I've been messing around with Spark and PyArrow Parquet reading. In my >> testing I've found that a Parquet file written by Spark containing a >> datetime column, results in different datetimes from Spark and PyArrow. >> >> The attached script demonstrates this. >> >> Output: >> >> Spark Reading the parquet file into a DataFrame: >> *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), >> Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]* >> >> PyArrow table has dates as UTC (7 hours ahead) >> >> >> >> *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[ >> Timestamp('2015-07-06 06:50:00')]* >> >> Pandas DF from pyarrow table has dates as UTC (7 hours ahead) >> >> >> >> * Date0 2015-07-06 06:50:001 2015-07-06 06:30:00* >> >> I would've expected to end up with the same datetime from both readers >> since there was no timezone attached at any point. It just a date and time >> value. >> Am I missing anything here? Or is this a bug. >> >> I attempted to intercept the timestamp values before pyarrow turns them >> into python objects so I could add timezone information which may fix this >> issue: >> >> The goal is to qualify the TimestampValue with a timezone (by creating a >> new column in the arrow table based off the previous one). If this can be >> done before the Value's are converted to python it may fix the issue I was >> having. But it doesn't appear that I can create a new Timestamp type column >> with the values from the old timestamp column. >> >> Here is the code I'm using: >> >> def chunkedToArray(data): >> for chunk in data.iterchunks(): >> for value in chunk: >> yield value >> >> def datetimeColumnsAddTimezone(table): >> for i, field in enumerate(table.schema): >> if field.type == pa.timestamp('ns'): >> newField = pa.field(field.name, pa.timestamp('ns', tz='GMT'), >> field.nullable, field.metadata) >> newArray = pa.array([val for val in >> chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT')) >> newColumn = pa.Column.from_array(newField, newArray) >> table = table.remove_column(i) >> table = table.add_column(i, newColumn) >> return table >> >> Cheers, Lucas Pickup