I am encountering this issue without ever explicitly creating a pandas 
`DataFrame` myself. It seems that when trying to use 
`pyarrow.parquet.write_to_dataset`, if one specifies an argument for 
`partition_cols`, pyarrow (for whatever reason) creates a `pandas.DataFrame` 
object for each partition, then converts that `DataFrame` back to a 
`pyarrow.Table` (see [source 
code](https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L1137)).
 Why it does this, I have no idea.

One consequence of this design decision, however, is that trying to write 
tables that have columns of type `date32[day]` not part of the partition 
columns, writing will often fail. I presume that the reason it fails is because 
of floating-point precision issues going back and forth between pandas and 
pyarrow (i.e., from `date32[day]` -> `datetime64[ns]` when creating each 
partition of the table in pandas, then from `datetime64[ns]` -> `date32[day]` 
when going back to pyarrow).

[ Full content available at: https://github.com/apache/arrow/issues/1920 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to