wjones127 commented on issue #12458:
URL: https://github.com/apache/arrow/issues/12458#issuecomment-1043714499
Hi @wukan1986! Thank you for the bug report and the very clear reproducible
example.
This seems to happen because column C lost type information when converting
from Pandas DataFrame to Arrow table (which happens inside of `to_parquet`.)
```python
pa.Table.from_pandas(df.head(0))
```
```
pyarrow.Table
A: timestamp[ns]
B: double
C: null
D: int64
__index_level_0__: timestamp[ns]
----
A: [[]]
B: [[]]
C: [0 nulls]
D: [[]]
__index_level_0__: [[]]
```
So those two parquet files have different schemas:
```python
import pyarrow.parquet as pq
print(pq.read_table(temp_dir + '/0.parquet').schema)
print(pq.read_table(temp_dir + '/1.parquet').schema)
```
```
A: timestamp[us]
B: double
C: null
D: int64
__index_level_0__: timestamp[us]
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' +
753
A: timestamp[us]
B: double
C: string
D: int64
__index_level_0__: timestamp[us]
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' +
755
```
I think this is a limitation of Pandas string columns, which just have
`dtype: object`, so if the column is empty there is nothing to tell Arrow what
type it is. The best way to work around this is to give the column a more
specific type. Either the Pandas string type, the PyArrow string type, or the
Categorical type would work well here.
```python
import pandas as pd
import pyarrow as pa
from tempfile import mkdtemp
print(pd.__version__) # 1.4.1
print(pa.__version__) # 7.0.0
dr = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
df = pd.DataFrame(index=dr)
df['A'] = pd.to_datetime('today')
df['B'] = 1.0
df['C'] = 'a'
df['D'] = 2
# Any of these will work
# df['C'] = pd.Series(df['C'], dtype="string")
# df['C'] = pd.Series(df['C'], dtype="string[pyarrow]")
df['C'] = pd.Categorical(df['C'])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]