[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns

GitBox Thu, 16 Dec 2021 03:18:48 -0800


mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995697816



   Ignoring my initial code sample where I was using real parquet file as 
source.
   This is fresh example using `df = pd.DataFrame(data={"col1": [None, ], 
"col2": ["foo1", ]})` as starting point.
   
   Where does pyarrow get `INT32` as "physical_type" when the column completely 
empty (only null values) 
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   ```
   
   
   ```python
   df = pd.DataFrame(data={"col1": [None, ], "col2": ["foo1", ]})
   ```
   
   
   ```python
   table = pa.Table.from_pandas(df)
   ```
   
   
   ```python
   pq.write_table(table, '/tmp/data.parquet')
   ```
   
   
   ```python
   legacy_dataset = pq.ParquetDataset('/tmp/data.parquet', 
use_legacy_dataset=True)
   ```
   
   
   ```python
   dataset = pq.ParquetDataset('/tmp/data.parquet', use_legacy_dataset=False)
   ```
   
   
   ```python
   legacy_dataset.schema
   ```
   
   
   
   
       <pyarrow._parquet.ParquetSchema object at 0x7efc1dc51a40>
       required group field_id=-1 schema {
         optional int32 field_id=-1 col1 (Null);
         optional binary field_id=-1 col2 (String);
       }
   
   
   
   
   ```python
   legacy_dataset.schema[0]
   ```
   
   
   
   
       <ParquetColumnSchema>
         name: col1
         path: col1
         max_definition_level: 1
         max_repetition_level: 0
         physical_type: INT32
         logical_type: Null
         converted_type (legacy): NONE
   
   
   
   
   ```python
   dataset.schema[0].type
   ```
   
   
   
   
       DataType(null)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns

Reply via email to