mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995697816
Ignoring my initial code sample where I was using real parquet file as
source.
This is fresh example using `df = pd.DataFrame(data={"col1": [None, ],
"col2": ["foo1", ]})` as starting point.
Where does pyarrow get `INT32` as "physical_type" when the column completely
empty (only null values)
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
```
```python
df = pd.DataFrame(data={"col1": [None, ], "col2": ["foo1", ]})
```
```python
table = pa.Table.from_pandas(df)
```
```python
pq.write_table(table, '/tmp/data.parquet')
```
```python
legacy_dataset = pq.ParquetDataset('/tmp/data.parquet',
use_legacy_dataset=True)
```
```python
dataset = pq.ParquetDataset('/tmp/data.parquet', use_legacy_dataset=False)
```
```python
legacy_dataset.schema
```
<pyarrow._parquet.ParquetSchema object at 0x7efc1dc51a40>
required group field_id=-1 schema {
optional int32 field_id=-1 col1 (Null);
optional binary field_id=-1 col2 (String);
}
```python
legacy_dataset.schema[0]
```
<ParquetColumnSchema>
name: col1
path: col1
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Null
converted_type (legacy): NONE
```python
dataset.schema[0].type
```
DataType(null)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]