[jira] [Commented] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

Wes McKinney (JIRA) Thu, 27 Sep 2018 15:19:02 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631106#comment-16631106
 ]


Wes McKinney commented on ARROW-3238:
-------------------------------------

It looks like there's an issue reading the last data page in fastparquet. See 
this gist. I created a file using the minimum value length (32755) which causes 
the bug to happen:

https://gist.github.com/wesm/56c66524b1237a08a528c6eb8d01e44b

The last seek before the exception is to the position 327853, but the file is 
only 459938 in length. Because 5 * 32755 is 163775, there's not enough bytes 
left in the file to decode the plain value page. It looks like the problem is 
in the Thrift decoding code which uses internal APIs of Thrift in Python:

{code}
2--> 32     file_obj.seek(starting_pos + blocks * bufsize + buffer_pos)
     33     return obj
     34 

ipdb> p starting_pos
229557
ipdb> p blocks
1
ipdb> p * bufsize
*** SyntaxError: invalid syntax
ipdb> p bufsize
65536
ipdb> p buffer_pos
32760
ipdb> p starting_pos + blocks * bufsize + buffer_pos
327853
ipdb> p starting_pos
229557
ipdb> p bufsize
65536
ipdb> p bt.cstringio_buf.tell()
32760
{code}

Whether the bug is in fastparquet or Thrift-Python is anyone's guess, but it 
appears the problem is caused by the fact that the metadata structure for the 
data page is large, because it contains statistics for 2x the size of the 
value, which is 65510. Including tags in the Thrift protocol, the size of the 
message is surely exceeding the buffer size of 65536 so I'm guessing that's the 
problem

Are other Parquet implementations able to read this file? If they are not I 
will be happy to keep investigating

> [Python] Can't read pyarrow string columns in fastparquet
> ---------------------------------------------------------
>
>                 Key: ARROW-3238
>                 URL: https://issues.apache.org/jira/browse/ARROW-3238
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Theo Walker
>            Priority: Major
>              Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
> read_fastparquet
> dff = pf.to_pandas(['A'])
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 426, in to_pandas
> index=index, assign=parts)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 258, in read_row_group
> scheme=self.file_scheme)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 344, in read_row_group
> cats, selfmade, assign=assign)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header 
> check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is 
> read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
>     df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
>     if os.path.isfile(FILENAME):
>         os.remove(FILENAME)
>     arrow_table = pyarrow.Table.from_pandas(df)
>     arrow_pq.write_table(arrow_table,
>     FILENAME,
>     use_dictionary=False,
>     compression='NONE',
>     row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
>     print "arrow:"
>     table2 = arrow_pq.read_table(FILENAME)
>     print table2.to_pandas().head()
> def read_fastparquet():
>     print "fastparquet:"
>     pf = ParquetFile(FILENAME)
>     dff = pf.to_pandas(['A'])
>     print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet()
> {code}
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 
> 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

Reply via email to