So, this is certainly good for future versions of Arrow. Do you have any specific recommendations for a workaround currently?
Saving a parquet file with datetimes will obviously be a common use case and if I'm understanding it correctly, right now saving a Parquet file with PyArrow that file will not be readable by Spark at this point. Yes? (I'm asking this as opposed to stating this). -Brian On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <wesmck...@gmail.com> wrote: > Indeed, INT96 is deprecated in the Parquet format. There are other > issues with Spark (it places restrictions on table field names, for > example), so it may be worth adding an option like > > pq.write_table(table, where, flavor='spark') > > or maybe better > > pq.write_table(table, where, flavor='spark-2.2') > > and this would set the correct options for that version of Spark. > > I created https://issues.apache.org/jira/browse/ARROW-1499 as a place > to discuss further > > - Wes > > > On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <briford.wy...@gmail.com> > wrote: > > Okay, > > > > So after some additional debugging, I can get around this if I set > > > > use_deprecated_int96_timestamps=True > > > > on the pq.write_table(arrow_table, filename, compression=compression, > > use_deprecated_int96_timestamps=True) call. > > > > But that just feels SO wrong....as I'm sure it's deprecated for a reason > > (i.e. this will bite me later and badly) > > > > > > I also see this issue (or at least a related issue) reference in this > Jeff > > Knupp blog... > > > > https://www.enigma.com/blog/moving-to-parquet-files-as-a- > system-of-record > > > > So shrug... any suggestions are greatly appreciated :) > > > > -Brian > > > > On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <briford.wy...@gmail.com> > > wrote: > > > >> Apologies if this isn't quite the right place to ask this question, but > I > >> figured Wes/others might know right off the bat :) > >> > >> > >> Context: > >> - Mac OSX Laptop > >> - PySpark: 2.2.0 > >> - PyArrow: 0.6.0 > >> - Pandas: 0.19.2 > >> > >> Issue Explanation: > >> - I'm converting my Pandas dataframe to a Parquet file with code very > >> similar to > >> - http://wesmckinney.com/blog/python-parquet-update/ > >> - My Pandas DataFrame has a datetime index: http_df.index.dtype = > >> dtype('<M8[ns]') > >> - When loading the saved parquet file I get the error below > >> - If I remove that index everything works fine > >> > >> ERROR: > >> - Py4JJavaError: An error occurred while calling o34.parquet. > >> : org.apache.spark.SparkException: Job aborted due to stage failure: > >> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 > >> in stage 0.0 (TID 0, localhost, executor driver): > >> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 > >> (TIMESTAMP_MICROS); > >> > >> Full Code to reproduce: > >> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_ > >> to_Parquet.ipynb > >> > >> > >> Thanks in advance, also big fan of all this stuff... "be the chicken" :) > >> > >> -Brian > >> > >> > >> > >> >