[ 
https://issues.apache.org/jira/browse/ARROW-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6844:
----------------------------------
    Labels: parquet pull-request-available  (was: parquet)

> [C++][Parquet][Python] List<scalar type> columns read broken with 0.15.0
> ------------------------------------------------------------------------
>
>                 Key: ARROW-6844
>                 URL: https://issues.apache.org/jira/browse/ARROW-6844
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.0
>            Reporter: Benoit Rostykus
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 1.0.0, 0.15.1
>
>         Attachments: dbg_sample.gz.parquet, dbg_sample2.gz.parquet
>
>
> Columns of type {{array<primitive type>}} (such as `array<int32>`, 
> `array<int64>`...) are not readable anymore using {{pyarrow == 0.15.0}} (but 
> were with {{pyarrow == 0.14.1}}) when the original writer of the parquet file 
> is {{parquet-mr 1.9.1}}.
> {code}
> import pyarrow.parquet as pq
> pf = pq.ParquetFile('sample.gz.parquet')
> print(pf.read(columns=['profile_ids']))
> {code}
> with 0.14.1:
> {code}
> pyarrow.Table
> profile_ids: list<element: int64>
>  child 0, element: int64
> ...
> {code}
> with 0.15.0:
> {code}
> Traceback (most recent call last):
>  File "<string>", line 1, in <module>
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1131, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int64> 
> is inconsistent with schema list<element: int64>
> {code}
> I've tested parquet files coming from multiple tables (with various schemas) 
> created with `parquet-mr`, couldn't read any `array<primitive type>` column 
> anymore.
>  
> I _think_ the bug was introduced with [this 
> commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].
> I think the root of the issue comes from the fact that `parquet-mr` writes 
> the inner struct name as `"element"` by default (see 
> [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]),
>  whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example 
> [this 
> test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]).
>  The round-tripping tests write/read in pyarrow only obviously won't catch 
> this.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to