[ https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930809#comment-16930809 ]
Krisztian Szucs commented on ARROW-5630: ---------------------------------------- We should start to write and regularly run hypothesis tests, it should be easy to write simple roundtrip tests for [tables|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/strategies.py#L247] covering [non-nullable|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/strategies.py#L103] cases. > [Python][Parquet] Table of nested arrays doesn't round trip > ----------------------------------------------------------- > > Key: ARROW-5630 > URL: https://issues.apache.org/jira/browse/ARROW-5630 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: pyarrow 0.13, Windows 10 > Reporter: Philip Felton > Assignee: Wes McKinney > Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > This is pyarrow 0.13 on Windows. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > def make_table(num_rows): > typ = pa.list_(pa.field("item", pa.float32(), False)) > return pa.Table.from_arrays([ > pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ), > pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ) > ], ['a', 'b']) > pq.write_table(make_table(1000000), 'test.parquet') > pq.read_table('test.parquet') > {code} > The last line throws the following exception: > {noformat} > --------------------------------------------------------------------------- > ArrowInvalid Traceback (most recent call last) > <ipython-input-4-0f3266afa36c> in <module> > ----> 1 pq.read_table('full.parquet') > ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, > columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem) > 1150 return fs.read_parquet(path, columns=columns, > 1151 use_threads=use_threads, > metadata=metadata, > -> 1152 > use_pandas_metadata=use_pandas_metadata) > 1153 > 1154 pf = ParquetFile(source, metadata=metadata) > ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, > path, columns, metadata, schema, use_threads, use_pandas_metadata) > 179 filesystem=self) > 180 return dataset.read(columns=columns, use_threads=use_threads, > --> 181 use_pandas_metadata=use_pandas_metadata) > 182 > 183 def open(self, path, mode='rb'): > ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, > use_threads, use_pandas_metadata) > 1012 table = piece.read(columns=columns, > use_threads=use_threads, > 1013 partitions=self.partitions, > -> 1014 > use_pandas_metadata=use_pandas_metadata) > 1015 tables.append(table) > 1016 > ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, > use_threads, partitions, open_file_func, file, use_pandas_metadata) > 562 table = reader.read_row_group(self.row_group, **options) > 563 else: > --> 564 table = reader.read(**options) > 565 > 566 if len(self.partition_keys) > 0: > ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, > use_threads, use_pandas_metadata) > 212 columns, use_pandas_metadata=use_pandas_metadata) > 213 return self.reader.read_all(column_indices=column_indices, > --> 214 use_threads=use_threads) > 215 > 216 def scan_contents(self, columns=None, batch_size=65536): > ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in > pyarrow._parquet.ParquetReader.read_all() > ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Column 1 named b expected length 932066 but got length 932063 > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)