jorisvandenbossche commented on issue #47460: URL: https://github.com/apache/arrow/issues/47460#issuecomment-3257621992
Sticking to the string topic here for a moment, you say: > it appears this is functionnaly a string and it's read as an object because all values are null. If a single value is not null, the column becomes a string. First, reading the Parquet file with pyarrow itself (into a Arrow table, without conversion to pandas) correctly reads both all-null and partially-null string columns as strings: ```python >>> import pyarrow.parquet as pq >>> table = pq.read_table("Downloads/test_parquet_null.parquet") >>> table pyarrow.Table test_int64: int64 test_name: string test_ts: timestamp[ns] string_partially_null: string ---- test_int64: [[1,2,null]] test_name: [[null,null,null]] test_ts: [[null,null,null]] string_partially_null: [[null,"toto","toto"]] ``` So the confusion is only for the arrow->pandas conversion. But currently with released pandas, both string columns get converted to object dtype (as that is the default way that pandas stores strings right now): ```python >>> table.to_pandas().dtypes test_int64 float64 test_name object test_ts datetime64[ns] string_partially_null object dtype: object ``` The upcoming version of pandas 3.0 will have a proper string dtype, and in that case both columns will be string columns: ```python # using pandas 3.0 dev >>> table.to_pandas().dtypes test_int64 float64 test_name str test_ts datetime64[ns] string_partially_null str dtype: object ``` In summary, I don't really understand what you mean with "it's read as an object because all values are null" (I don't see any difference between the all-null vs partially-null column) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org