jorisvandenbossche commented on issue #47460:
URL: https://github.com/apache/arrow/issues/47460#issuecomment-3257621992

   Sticking to the string topic here for a moment, you say:
   
   >  it appears this is functionnaly a string and it's read as an object 
because all values are null. If a single value is not null, the column becomes 
a string.
   
   First, reading the Parquet file with pyarrow itself (into a Arrow table, 
without conversion to pandas) correctly reads both all-null and partially-null 
string columns as strings:
   
   ```python
   >>> import pyarrow.parquet as pq
   >>> table = pq.read_table("Downloads/test_parquet_null.parquet")
   >>> table
   pyarrow.Table
   test_int64: int64
   test_name: string
   test_ts: timestamp[ns]
   string_partially_null: string
   ----
   test_int64: [[1,2,null]]
   test_name: [[null,null,null]]
   test_ts: [[null,null,null]]
   string_partially_null: [[null,"toto","toto"]]
   ```
   
   So the confusion is only for the arrow->pandas conversion. But currently 
with released pandas, both string columns get converted to object dtype (as 
that is the default way that pandas stores strings right now):
   
   ```python
   >>> table.to_pandas().dtypes
   test_int64                      float64
   test_name                        object
   test_ts                  datetime64[ns]
   string_partially_null            object
   dtype: object
   ```
   
   The upcoming version of pandas 3.0 will have a proper string dtype, and in 
that case both columns will be string columns:
   
   ```python
   # using pandas 3.0 dev
   >>> table.to_pandas().dtypes
   test_int64                      float64
   test_name                           str
   test_ts                  datetime64[ns]
   string_partially_null               str
   dtype: object
   ```
   
   In summary, I don't really understand what you mean with "it's read as an 
object because all values are null" (I don't see any difference between the 
all-null vs partially-null column)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to