Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Wes McKinney Mon, 28 Aug 2017 14:24:26 -0700

see https://issues.apache.org/jira/browse/ARROW-1425


On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney <[email protected]> wrote:
> hi Lucas,
>
> Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge
> of the Spark timestamp issue (which is a known, and not a bug per se)
> should be able to give some extra context about this.
>
> My understanding is that when you read timezone-naive data in Spark,
> it is treated as session-local by the Spark runtime, and so the values
> that are written to Parquet will change based on the runtime locale. I
> think you can resolve this by casting the Spark timestamps to UTC to
> force normalization or setting the runtime locale to GMT/UTC. My
> apologies if I am mistaken about this.
>
> In Arrow, timestamps have two forms:
>
> * Time zone naive (where tz=None in Python); there is no notion of UTC
> or session-localness.
> * Time zone aware, the integer values are internally normalized to UTC
>
> The difficulty is that when you have time zone naive data, Spark may
> interpret the values differently based on your system locale. This is
> a pretty serious rough edge in my opinion; we should at minimum add a
> guide to using Spark and pyarrow together in the pyarrow documentation
> so that these "gotchas" can be well explained in a single place.
>
> - Wes
>
> On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup
> <[email protected]> wrote:
>> Hi all,
>>
>> Very sorry if people already responded to this at:
>> [email protected] There was an INVALID identifier attached to the
>> end of the reply address for some reason which may have caused replies to
>> be lost.
>>
>> I've been messing around with Spark and PyArrow Parquet reading. In my
>> testing I've found that a Parquet file written by Spark containing a
>> datetime column, results in different datetimes from Spark and PyArrow.
>>
>> The attached script demonstrates this.
>>
>> Output:
>>
>> Spark Reading the parquet file into a DataFrame:
>> *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)),
>> Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]*
>>
>> PyArrow table has dates as UTC (7 hours ahead)
>>
>>
>>
>> *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[
>> Timestamp('2015-07-06 06:50:00')]*
>>
>> Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
>>
>>
>>
>> *                 Date0 2015-07-06 06:50:001 2015-07-06 06:30:00*
>>
>> I would've expected to end up with the same datetime from both readers
>> since there was no timezone attached at any point. It just a date and time
>> value.
>> Am I missing anything here? Or is this a bug.
>>
>> I attempted to intercept the timestamp values before pyarrow turns them
>> into python objects so I could add timezone information which may fix this
>> issue:
>>
>> The goal is to qualify the TimestampValue with a timezone (by creating a
>> new column in the arrow table based off the previous one). If this can be
>> done before the Value's are converted to python it may fix the issue I was
>> having. But it doesn't appear that I can create a new Timestamp type column
>> with the values from the old timestamp column.
>>
>> Here is the code I'm using:
>>
>> def chunkedToArray(data):
>>     for chunk in data.iterchunks():
>>         for value in chunk:
>>             yield value
>>
>>  def datetimeColumnsAddTimezone(table):
>>     for i, field in enumerate(table.schema):
>>         if field.type == pa.timestamp('ns'):
>>             newField = pa.field(field.name, pa.timestamp('ns', tz='GMT'),
>> field.nullable, field.metadata)
>>             newArray = pa.array([val for val in
>> chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT'))
>>             newColumn = pa.Column.from_array(newField, newArray)
>>             table = table.remove_column(i)
>>             table = table.add_column(i, newColumn)
>>    return table
>>
>> Cheers, Lucas Pickup

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Reply via email to