[jira] [Updated] (ARROW-12201) [C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType

Jira Sun, 04 Apr 2021 23:17:02 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jorge Leitão updated ARROW-12201:
---------------------------------
    Description: 
When writing a `uint32` column, (parquet's) logical type is not written, 
limiting interoperability with other engines.

Minimal Python

```
import pyarrow as pa

data = {"uint32", [1, None, 0]}
schema = pa.schema([pa.field('uint32', pa.uint32())])

t = pa.table(data, schema=schema)
pa.parquet.write_table(t, "bla.parquet")
```
 
Inspecting it with spark:

```
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.parquet("bla.parquet")
print(df.select("uint32").schema)
```

shows `StructType(List(StructField(uint32,LongType,true)))`. "LongType" 
indicates that the field is interpreted as a 64 bit integer. Further inspection 
of the metadata shows that both convertedType and logicalType are not being 
set. Note that this is independent of the arrow-specific schema written in the 
metadata.

  was:
When writing a `uint32` column, (parquet's) logical type is not written, 
limiting interoperability with other engines.

Minimal Python

```
import pyarrow as pa

data = {"uint32", [1, None, 0]}
schema = pa.schema([pa.field('uint32', pa.uint32())])

t = pa.table(data, schema=schema)
pa.parquet.write_table(t, "bla.parquet")
```
 
Inspecting it with spark:

```
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.parquet("bla.parquet")
print(df.select("uint32").schema)
```

shows `StructType(List(StructField(uint32,LongType,true)))`. "LongType" 
indicates that the field is interpreted as a 64 bit integer. Further inspection 
would determine that both convertedType and logicalType are not being set. Note 
that this is independent of the arrow-specific schema written in the metadata.


> [C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType
> ----------------------------------------------------------------------
>
>                 Key: ARROW-12201
>                 URL: https://issues.apache.org/jira/browse/ARROW-12201
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 3.0.0
>            Reporter: Jorge Leitão
>            Priority: Minor
>
> When writing a `uint32` column, (parquet's) logical type is not written, 
> limiting interoperability with other engines.
> Minimal Python
> ```
> import pyarrow as pa
> data = {"uint32", [1, None, 0]}
> schema = pa.schema([pa.field('uint32', pa.uint32())])
> t = pa.table(data, schema=schema)
> pa.parquet.write_table(t, "bla.parquet")
> ```
>  
> Inspecting it with spark:
> ```
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> df = spark.read.parquet("bla.parquet")
> print(df.select("uint32").schema)
> ```
> shows `StructType(List(StructField(uint32,LongType,true)))`. "LongType" 
> indicates that the field is interpreted as a 64 bit integer. Further 
> inspection of the metadata shows that both convertedType and logicalType are 
> not being set. Note that this is independent of the arrow-specific schema 
> written in the metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12201) [C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType

Reply via email to