wjones127 commented on issue #12458:
URL: https://github.com/apache/arrow/issues/12458#issuecomment-1043714499


   Hi @wukan1986! Thank you for the bug report and the very clear reproducible 
example. 
   
   This seems to happen because column C lost type information when converting 
from Pandas DataFrame to Arrow table (which happens inside of `to_parquet`.)
   
   ```python
   pa.Table.from_pandas(df.head(0))
   ```
   
   ```
   pyarrow.Table
   A: timestamp[ns]
   B: double
   C: null
   D: int64
   __index_level_0__: timestamp[ns]
   ----
   A: [[]]
   B: [[]]
   C: [0 nulls]
   D: [[]]
   __index_level_0__: [[]]
   ```
   
   So those two parquet files have different schemas:
   
   ```python
   import pyarrow.parquet as pq
   
   print(pq.read_table(temp_dir + '/0.parquet').schema)
   print(pq.read_table(temp_dir + '/1.parquet').schema)
   ```
   
   ```
   A: timestamp[us]
   B: double
   C: null
   D: int64
   __index_level_0__: timestamp[us]
   -- schema metadata --
   pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 
753
   A: timestamp[us]
   B: double
   C: string
   D: int64
   __index_level_0__: timestamp[us]
   -- schema metadata --
   pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 
755
   ```
   
   I think this is a limitation of Pandas string columns, which just have 
`dtype: object`, so if the column is empty there is nothing to tell Arrow what 
type it is. The best way to work around this is to give the column a more 
specific type. Either the Pandas string type, the PyArrow string type, or the 
Categorical type would work well here.
   
   
   ```python
   import pandas as pd
   import pyarrow as pa
   from tempfile import mkdtemp
   
   print(pd.__version__)  # 1.4.1
   print(pa.__version__)  # 7.0.0
   
   dr = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
   df = pd.DataFrame(index=dr)
   df['A'] = pd.to_datetime('today')
   df['B'] = 1.0
   df['C'] = 'a'
   df['D'] = 2
   
   # Any of these will work
   # df['C'] = pd.Series(df['C'], dtype="string")
   # df['C'] = pd.Series(df['C'], dtype="string[pyarrow]")
   df['C'] = pd.Categorical(df['C'])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to