[jira] [Commented] (ARROW-5630) [Python][Parquet] Table of nested arrays doesn't round trip

Krisztian Szucs (Jira) Mon, 16 Sep 2019 12:40:31 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930809#comment-16930809
 ]


Krisztian Szucs commented on ARROW-5630:
----------------------------------------

We should start to write and regularly run hypothesis tests, it should be easy 
to write simple roundtrip tests for 
[tables|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/strategies.py#L247]
 covering 
[non-nullable|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/strategies.py#L103]
 cases.

> [Python][Parquet] Table of nested arrays doesn't round trip
> -----------------------------------------------------------
>
>                 Key: ARROW-5630
>                 URL: https://issues.apache.org/jira/browse/ARROW-5630
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: pyarrow 0.13, Windows 10
>            Reporter: Philip Felton
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.15.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
>     typ = pa.list_(pa.field("item", pa.float32(), False))
>     return pa.Table.from_arrays([
>         pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
>         pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
>     ], ['a', 'b'])
> pq.write_table(make_table(1000000), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-4-0f3266afa36c> in <module>
> ----> 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>    1150         return fs.read_parquet(path, columns=columns,
>    1151                                use_threads=use_threads, 
> metadata=metadata,
> -> 1152                                
> use_pandas_metadata=use_pandas_metadata)
>    1153 
>    1154     pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
>     179                                  filesystem=self)
>     180         return dataset.read(columns=columns, use_threads=use_threads,
> --> 181                             use_pandas_metadata=use_pandas_metadata)
>     182 
>     183     def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>    1012             table = piece.read(columns=columns, 
> use_threads=use_threads,
>    1013                                partitions=self.partitions,
> -> 1014                                
> use_pandas_metadata=use_pandas_metadata)
>    1015             tables.append(table)
>    1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
>     562             table = reader.read_row_group(self.row_group, **options)
>     563         else:
> --> 564             table = reader.read(**options)
>     565 
>     566         if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>     212             columns, use_pandas_metadata=use_pandas_metadata)
>     213         return self.reader.read_all(column_indices=column_indices,
> --> 214                                     use_threads=use_threads)
>     215 
>     216     def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5630) [Python][Parquet] Table of nested arrays doesn't round trip

Reply via email to