Writing really long strings from pyarrow causes exception in fastparquet read. 

```
Traceback (most recent call last):
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
    read_fastparquet()
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
read_fastparquet
    dff = pf.to_pandas(['A'])
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 
426, in to_pandas
    index=index, assign=parts)
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 
258, in read_row_group
    scheme=self.file_scheme)
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 
344, in read_row_group
    cats, selfmade, assign=assign)
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 
321, in read_row_group_arrays
    catdef=out.get(name+'-catdef', None))
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 
235, in read_col
    skip_nulls, selfmade=selfmade)
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 
99, in read_data_page
    raw_bytes = _read_page(f, header, metadata)
  File 
"/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 
31, in _read_page
    page_header.uncompressed_page_size)
AssertionError: found 175532 raw bytes (expected 200026)
```

If written with compression, it reports compression errors instead:

SNAPPY: `snappy.UncompressError: Error while decompressing: invalid input`

GZIP: `zlib.error: Error -3 while decompressing data: incorrect header check`

Minimal code to reproduce:
```
import os
import pandas as pd
import pyarrow
import pyarrow.parquet as arrow_pq
from fastparquet import ParquetFile

# data to generate
ROW_LENGTH = 40000  # decreasing below 32750ish eliminates exception
N_ROWS = 10

# file write params
ROW_GROUP_SIZE = 5  # Lower numbers eliminate exception, but strange data is 
read (e.g. Nones)
FILENAME = 'test.parquet'

def write_arrow():
    df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
    if os.path.isfile(FILENAME):
        os.remove(FILENAME)
    arrow_table = pyarrow.Table.from_pandas(df)
    arrow_pq.write_table(arrow_table,
                         FILENAME,
                         use_dictionary=False,
                         compression='NONE',
                         row_group_size=ROW_GROUP_SIZE)


def read_arrow():
    print "arrow:"
    table2 = arrow_pq.read_table(FILENAME)
    print table2.to_pandas().head()


def read_fastparquet():
    print "fastparquet:"
    pf = ParquetFile(FILENAME)
    dff = pf.to_pandas(['A'])
    print dff.head()


write_arrow()
read_arrow()
read_fastparquet()
```


Versions:
`fastparquet==0.1.6`
`pyarrow==0.10.0`
`pandas==0.22.0`
`sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May  1 2018, 
18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'`

Also opened issue here: https://github.com/dask/fastparquet/issues/375

[ Full content available at: https://github.com/apache/arrow/issues/2562 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to