mishbahr opened a new issue #11967:
URL: https://github.com/apache/arrow/issues/11967


   I'm writing some DataFrame to binary parquet format with one or more entire 
null object columns.
   
   If I then load the parquet dataset with `use_legacy_dataset=False`
   
   ```python
   parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False, 
**kwargs)
   type(parquet)
   pyarrow.parquet._ParquetDatasetV2
   
   ```
   
   It returns an `_ParquetDatasetV2` instance and when I check the schema.
   
   ```python
   type(parquet_dataset.schema) 
   pyarrow.lib.Schema
   ```
   
   If I load the same file but with `use_legacy_dataset=True`
   
   ```python
   parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True, 
**kwargs)
   ```
   
   The schema for the file is an instance of `ParquetSchema`
   
   ```python
   type(parquet_dataset2.schema)
   pyarrow._parquet.ParquetSchema
   ```
   
   This is as I would expect and I'm aware that I can get the "arrow schema" 
like this.
   
   ```python
   arrow_schema = parquet_dataset2.schema.to_arrow_schema()
   type(arrow_schema)
   pyarrow.lib.Schema
   ```
   
   i.e same format as when I use `use_legacy_dataset=False`
   
   For an instance of `ParquetSchema`, I can get details of any column. e.g
   
   
   ```python
   parquet_dataset2.schema[13]
   
   <ParquetColumnSchema>
     name: col13
     path: col13
     max_definition_level: 1
     max_repetition_level: 0
     physical_type: INT96
     logical_type: None
     converted_type (legacy): NONE
   ```
   
   Here the "physical_type" for this column is INT96.
   
   ```python
   parquet.schema[13].physical_type
   'INT32'
   ```
   
   For an instance of `pyarrow.lib.Schema`, if I get the "data type" for the 
same column.
   
   ```python
   parquet_dataset.schema.field("col13").type
   DataType(null)
   ```
   
   i.e with no information about what the "data type" is supposed to be.
   
   This information is available in the Parquet file. But how do I access it?
   
   Is there way to convert instance of `pyarrow.lib.Schema` -> 
`pyarrow._parquet.ParquetSchema`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to