jorisvandenbossche commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995708589


   > Where does pyarrow get `INT32` as "physical_type" when the column 
completely empty (only null values)
   
   For Parquet, you need to distinguish the "physical_type" and "logical_type" 
(as shown in the output of the `ParquetColumnSchema`, this is "INT32" vs "Null" 
for this column of all nulls). 
   
   Parquet only has a limited set of physical types, see 
https://github.com/apache/parquet-format#types. And "Null" is not a physical 
type, but only a logical type. And a logical type always "annotates" some 
actual physical type. 
   
   So when Arrow saves a "null" column (in Arrow this is an actual, proper 
type) to Parquet, it can use a "Null" logical type, but it still needs to 
choose some physical type for the column in the Parquet file. And by default, 
Arrow uses INT32 for the physical type. 
   
   That explains where the "INT32" physical type is coming from. But in 
general, I think you don't need to care about this / you can ignore this. When 
reading the Parquet file to an Arrow table, we will correctly notice the "Null" 
logical type, and create a "null" column in the resulting Arrow table 
(basically ignoring the INT32 physical type of the Parquet field)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to