mishbahr opened a new issue #11967:
URL: https://github.com/apache/arrow/issues/11967
I'm writing some DataFrame to binary parquet format with one or more entire
null object columns.
If I then load the parquet dataset with `use_legacy_dataset=False`
```python
parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False,
**kwargs)
type(parquet)
pyarrow.parquet._ParquetDatasetV2
```
It returns an `_ParquetDatasetV2` instance and when I check the schema.
```python
type(parquet_dataset.schema)
pyarrow.lib.Schema
```
If I load the same file but with `use_legacy_dataset=True`
```python
parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True,
**kwargs)
```
The schema for the file is an instance of `ParquetSchema`
```python
type(parquet_dataset2.schema)
pyarrow._parquet.ParquetSchema
```
This is as I would expect and I'm aware that I can get the "arrow schema"
like this.
```python
arrow_schema = parquet_dataset2.schema.to_arrow_schema()
type(arrow_schema)
pyarrow.lib.Schema
```
i.e same format as when I use `use_legacy_dataset=False`
For an instance of `ParquetSchema`, I can get details of any column. e.g
```python
parquet_dataset2.schema[13]
<ParquetColumnSchema>
name: col13
path: col13
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
```
Here the "physical_type" for this column is INT96.
```python
parquet.schema[13].physical_type
'INT32'
```
For an instance of `pyarrow.lib.Schema`, if I get the "data type" for the
same column.
```python
parquet_dataset.schema.field("col13").type
DataType(null)
```
i.e with no information about what the "data type" is supposed to be.
This information is available in the Parquet file. But how do I access it?
Is there way to convert instance of `pyarrow.lib.Schema` ->
`pyarrow._parquet.ParquetSchema`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]