Re: spark error with reading parquet file created vis pandas/pyarrow

Wes McKinney Thu, 14 Sep 2017 12:25:47 -0700

The option

pq.write_table(..., flavor='spark')


made it into the 0.7.0 release

- Wes

On Fri, Sep 8, 2017 at 6:28 PM, Julien Le Dem <[email protected]> wrote:
> The int96 deprecation is slowly bubbling up the stack. There are still 
> discussions in spark on how to make the change. So for now even though it's 
> deprecated it is still used in some places. This should get resolved in the 
> near future.
>
> Julien
>
>> On Sep 8, 2017, at 14:12, Wes McKinney <[email protected]> wrote:
>>
>> Turning on int96 timestamps is the solution right now. To save
>> yourself some typing, you could declare
>>
>> parquet_options = {
>>    'compression': ...,
>>    'use_deprecated_int96_timestamps': True
>> }
>>
>> pq.write_table(..., **parquet_options)
>>
>>> On Fri, Sep 8, 2017 at 5:08 PM, Brian Wylie <[email protected]> wrote:
>>> So, this is certainly good for future versions of Arrow. Do you have any
>>> specific recommendations for a workaround currently?
>>>
>>> Saving a parquet file with datetimes will obviously be a common use case
>>> and if I'm understanding it correctly, right now saving a Parquet file with
>>> PyArrow that file will not be readable by Spark at this point. Yes?  (I'm
>>> asking this as opposed to stating this).
>>>
>>> -Brian
>>>
>>>> On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <[email protected]> wrote:
>>>>
>>>> Indeed, INT96 is deprecated in the Parquet format. There are other
>>>> issues with Spark (it places restrictions on table field names, for
>>>> example), so it may be worth adding an option like
>>>>
>>>> pq.write_table(table, where, flavor='spark')
>>>>
>>>> or maybe better
>>>>
>>>> pq.write_table(table, where, flavor='spark-2.2')
>>>>
>>>> and this would set the correct options for that version of Spark.
>>>>
>>>> I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
>>>> to discuss further
>>>>
>>>> - Wes
>>>>
>>>>
>>>> On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <[email protected]>
>>>> wrote:
>>>>> Okay,
>>>>>
>>>>> So after some additional debugging, I can get around this if I set
>>>>>
>>>>> use_deprecated_int96_timestamps=True
>>>>>
>>>>> on the pq.write_table(arrow_table, filename, compression=compression,
>>>>> use_deprecated_int96_timestamps=True) call.
>>>>>
>>>>> But that just feels SO wrong....as I'm sure it's deprecated for a reason
>>>>> (i.e. this will bite me later and badly)
>>>>>
>>>>>
>>>>> I also see this issue (or at least a related issue) reference in this
>>>> Jeff
>>>>> Knupp blog...
>>>>>
>>>>> https://www.enigma.com/blog/moving-to-parquet-files-as-a-
>>>> system-of-record
>>>>>
>>>>> So shrug... any suggestions are greatly appreciated :)
>>>>>
>>>>> -Brian
>>>>>
>>>>> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Apologies if this isn't quite the right place to ask this question, but
>>>> I
>>>>>> figured Wes/others might know right off the bat :)
>>>>>>
>>>>>>
>>>>>> Context:
>>>>>> - Mac OSX Laptop
>>>>>> - PySpark: 2.2.0
>>>>>> - PyArrow: 0.6.0
>>>>>> - Pandas: 0.19.2
>>>>>>
>>>>>> Issue Explanation:
>>>>>> - I'm converting my Pandas dataframe to a Parquet file with code very
>>>>>> similar to
>>>>>>       - http://wesmckinney.com/blog/python-parquet-update/
>>>>>> - My Pandas DataFrame has a datetime index:  http_df.index.dtype =
>>>>>> dtype('<M8[ns]')
>>>>>> - When loading the saved parquet file I get the error below
>>>>>> - If I remove that index everything works fine
>>>>>>
>>>>>> ERROR:
>>>>>> - Py4JJavaError: An error occurred while calling o34.parquet.
>>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>>>>>> in stage 0.0 (TID 0, localhost, executor driver):
>>>>>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>>>>>> (TIMESTAMP_MICROS);
>>>>>>
>>>>>> Full Code to reproduce:
>>>>>> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>>>>>> to_Parquet.ipynb
>>>>>>
>>>>>>
>>>>>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>>>>>
>>>>>> -Brian
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>

Re: spark error with reading parquet file created vis pandas/pyarrow

Reply via email to