[jira] [Commented] (ARROW-434) Segfaults and encoding issues in Python Parquet reads

Wes McKinney (JIRA) Tue, 20 Dec 2016 17:19:08 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765767#comment-15765767
 ]


Wes McKinney commented on ARROW-434:
------------------------------------

We hadn't yet dealt with binary (or non-UTF8 string) data, so there were a 
couple things to do around there. ARROW-374 
(https://github.com/apache/arrow/pull/249) and PARQUET-812 
(https://github.com/apache/parquet-cpp/pull/206) are in code review, so it will 
take a day or so for updated packages to hit conda-forge, but in any case I 
have:

{code}
In [5]: data = open('/home/wesm/Downloads/nation.impala.parquet', 'rb').read()

In [6]: import io

In [7]: buf = io.BytesIO(data)

In [8]: import pyarrow.parquet as pq

In [9]: table = pq.read_table(buf)

In [10]: table.schema
Out[10]: 
n_nationkey: int32
n_name: binary
n_regionkey: int32
n_comment: binary

In [11]: table.to_pandas()
Out[11]: 
    n_nationkey          n_name  n_regionkey  \
0             0         ALGERIA            0   
1             1       ARGENTINA            1   
2             2          BRAZIL            1   
3             3          CANADA            1   
4             4           EGYPT            4   
5             5        ETHIOPIA            0   
6             6          FRANCE            3   
7             7         GERMANY            3   
8             8           INDIA            2   
9             9       INDONESIA            2   
10           10            IRAN            4   
11           11            IRAQ            4   
12           12           JAPAN            2   
13           13          JORDAN            4   
14           14           KENYA            0   
15           15         MOROCCO            0   
16           16      MOZAMBIQUE            0   
17           17            PERU            1   
18           18           CHINA            2   
19           19         ROMANIA            3   
20           20    SAUDI ARABIA            4   
21           21         VIETNAM            2   
22           22          RUSSIA            3   
23           23  UNITED KINGDOM            3   
24           24   UNITED STATES            1   

                                            n_comment  
0    haggle. carefully final deposits detect slyly...  
1   al foxes promise slyly according to the regula...  
2   y alongside of the pending deposits. carefully...  
3   eas hang ironic, silent packages. slyly regula...  
4   y above the carefully unusual theodolites. fin...  
5                     ven packages wake quickly. regu  
6              refully final requests. regular, ironi  
7   l platelets. regular accounts x-ray: unusual, ...  
8   ss excuses cajole slyly across the packages. d...  
9    slyly express asymptotes. regular deposits ha...  
10  efully alongside of the slyly final dependenci...  
11  nic deposits boost atop the quickly final requ...  
12               ously. final, express gifts cajole a  
13  ic deposits are blithely about the carefully r...  
14   pending excuses haggle furiously deposits. pe...  
15  rns. blithely bold courts among the closely re...  
16      s. ironic, unusual asymptotes wake blithely r  
17  platelets. blithely pending dependencies use f...  
18  c dependencies. furiously express notornis sle...  
19  ular asymptotes are about the furious multipli...  
20  ts. silent requests haggle. closely express pa...  
21     hely enticingly express accounts. even, final   
22   requests against the platelets use never acco...  
23  eans boost carefully special requests. account...  
24  y final packages. slow foxes cajole quickly. q... 
{code}

> Segfaults and encoding issues in Python Parquet reads
> -----------------------------------------------------
>
>                 Key: ARROW-434
>                 URL: https://issues.apache.org/jira/browse/ARROW-434
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge
>            Reporter: Matthew Rocklin
>            Assignee: Wes McKinney
>            Priority: Minor
>              Labels: parquet, python
>
> I've conda installed pyarrow and am trying to read data from the 
> parquet-compatibility project.  I haven't explicitly built parquet-cpp or 
> anything and may or may not have old versions lying around, so please take 
> this issue with some salt:
> {code:python}
> In [1]: import pyarrow.parquet
> In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet')
> ---------------------------------------------------------------------------
> ArrowException                            Traceback (most recent call last)
> <ipython-input-2-5d966681a384> in <module>()
> ----> 1 t = pyarrow.parquet.read_table('nation.plain.parquet')
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx
>  in pyarrow.parquet.read_table 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx
>  in pyarrow.parquet.ParquetReader.read_all 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx
>  in pyarrow.error.check_status 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)()
> ArrowException: NotImplemented: list<: uint8>
> {code}
> Additionally I tried to read data from a Python file-like object pointing to 
> data on S3.  Let me know if you'd prefer a separate issue.
> {code:python}
> In [1]: import s3fs
> In [2]: fs = s3fs.S3FileSystem()
> In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet')
> In [4]: f.read(100)
> Out[4]: 
> b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b'
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> Here is a more reproducible version:
> {code:python}
> In [1]: with open('nation.plain.parquet', 'rb') as f:
>    ...:     data = f.read()
>    ...:     
> In [2]: from io import BytesIO
> In [3]: f = BytesIO(data)
> In [4]: f.seek(0)
> Out[4]: 0
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> I was however pleased with round-trip functionality within this project, 
> which was very pleasant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ARROW-434) Segfaults and encoding issues in Python Parquet reads

Reply via email to