jorisvandenbossche commented on issue #11967: URL: https://github.com/apache/arrow/issues/11967#issuecomment-995708589
> Where does pyarrow get `INT32` as "physical_type" when the column completely empty (only null values) For Parquet, you need to distinguish the "physical_type" and "logical_type" (as shown in the output of the `ParquetColumnSchema`, this is "INT32" vs "Null" for this column of all nulls). Parquet only has a limited set of physical types, see https://github.com/apache/parquet-format#types. And "Null" is not a physical type, but only a logical type. And a logical type always "annotates" some actual physical type. So when Arrow saves a "null" column (in Arrow this is an actual, proper type) to Parquet, it can use a "Null" logical type, but it still needs to choose some physical type for the column in the Parquet file. And by default, Arrow uses INT32 for the physical type. That explains where the "INT32" physical type is coming from. But in general, I think you don't need to care about this / you can ignore this. When reading the Parquet file to an Arrow table, we will correctly notice the "Null" logical type, and create a "null" column in the resulting Arrow table (basically ignoring the INT32 physical type of the Parquet field) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
