Yevgeni Litvin created ARROW-6849:
-------------------------------------

             Summary: Can not read a list of items type 
                 Key: ARROW-6849
                 URL: https://issues.apache.org/jira/browse/ARROW-6849
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.0
            Reporter: Yevgeni Litvin
         Attachments: test_bad_parquet.tgz

A field having a type of list-of-ints can not be read using 
{{parrow.parquet.read_table}} function.

This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is 
not observed.

pyspark version: 2.4.4[^test_bad_parquet.tgz]

Minimal snippet to reproduce the issue:

 
{code:java}
import pyarrow.parquet as pq
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, 
Row

output_url = '/tmp/test_bad_parquet'
spark = SparkSession.builder.getOrCreate()

schema = StructType([StructField('int_fixed_size_list', 
ArrayType(IntegerType(), False), False)])
rows = [Row(int_fixed_size_list=[1, 2, 3])]
dataframe = spark.createDataFrame(rows, 
schema).write.mode('overwrite').parquet(output_url)

pq.read_table(output_url)

{code}
I get an error:
{code:java}
Traceback (most recent call last):
  File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in 
<module>
    pq.read_table(output_url)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1281, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1137, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 605, in read
    table = reader.read(**options)
  File 
"/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 253, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1136, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int32 
not null> is inconsistent with schema list<element: int32 not null>Process 
finished with exit code 1

{code}
 

Column data for field 0 with type list<item: int32 not null> is inconsistent 
with schema list<element: int32 not null>

 

A parquet store, as generated by the snippet is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to