[ 
https://issues.apache.org/jira/browse/ARROW-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-6849.
----------------------------------------
    Resolution: Duplicate

> [Python] can not read a parquet store containing a list of integers 
> --------------------------------------------------------------------
>
>                 Key: ARROW-6849
>                 URL: https://issues.apache.org/jira/browse/ARROW-6849
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.0
>            Reporter: Yevgeni Litvin
>            Priority: Major
>         Attachments: test_bad_parquet.tgz
>
>
> A field having a type of list-of-ints can not be read using 
> {{parrow.parquet.read_table}} function. Also failed with other field types 
> (observed strings, for example).
> This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is 
> not observed.
> pyspark version: 2.4.4[^test_bad_parquet.tgz]
> Minimal snippet to reproduce the issue:
>  
> {code:java}
> import pyarrow.parquet as pq
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> ArrayType, Row
> output_url = '/tmp/test_bad_parquet'
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([StructField('int_fixed_size_list', 
> ArrayType(IntegerType(), False), False)])
> rows = [Row(int_fixed_size_list=[1, 2, 3])]
> dataframe = spark.createDataFrame(rows, 
> schema).write.mode('overwrite').parquet(output_url)
> pq.read_table(output_url)
> {code}
> I get an error:
> {code:java}
> Traceback (most recent call last):
>   File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in 
> <module>
>     pq.read_table(output_url)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1281, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1137, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 605, in read
>     table = reader.read(**options)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
>     use_threads=use_threads)
>   File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int32 
> not null> is inconsistent with schema list<element: int32 not null>Process 
> finished with exit code 1
> {code}
>  
> Column data for field 0 with type list<item: int32 not null> is inconsistent 
> with schema list<element: int32 not null>
>  
> A parquet store, as generated by the snippet is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to