I am encountering this issue without ever explicitly creating a pandas `DataFrame` myself. It seems that when trying to use `pyarrow.parquet.write_to_dataset`, if one specifies an argument for `partition_cols`, pyarrow (for whatever reason) creates a `pandas.DataFrame` object for each partition, then converts that `DataFrame` back to a `pyarrow.Table` (see [source code](https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L1137)). Why it does this, I have no idea.
One consequence of this design decision, however, is that trying to write tables that have columns of type `date32[day]` not part of the partition columns, writing will often fail. I presume that the reason it fails is because of floating-point precision issues going back and forth between pandas and pyarrow (i.e., from `date32[day]` -> `datetime64[ns]` when creating each partition of the table in pandas, then from `datetime64[ns]` -> `date32[day]` when going back to pyarrow). [ Full content available at: https://github.com/apache/arrow/issues/1920 ] This message was relayed via gitbox.apache.org for [email protected]
