[
https://issues.apache.org/jira/browse/ARROW-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081027#comment-17081027
]
Wes McKinney commented on ARROW-8385:
-------------------------------------
I haven't succeeded in reproducing this (including reading the crash.parquet
you attached) in a few different environments:
* pip freeze matching yours and Python 3.8.2 from python.org (so not using
Anaconda or conda-forge)
* pyarrow build from master branch on Python 3.7 with full conda-forge stack
* Python 3.8 from conda-forge but using 0.16.0 wheel
I would guess there is something peculiar about your machine or your
environment. My Windows 10 machine is relatively recent (i7-8809G GPU). What
are you running?
> [Python][Parquet] Crash on parquet.read_table on windows python 3.82
> --------------------------------------------------------------------
>
> Key: ARROW-8385
> URL: https://issues.apache.org/jira/browse/ARROW-8385
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0
> Environment: Window 10
> python 3.8.2 pip 20.0.2
> pip freeze ->
> numpy==1.18.2
> pandas==1.0.3
> pyarrow==0.16.0
> python-dateutil==2.8.1
> pytz==2019.3
> six==1.14.0
> Reporter: Geoff Quested-Jones
> Assignee: Wes McKinney
> Priority: Major
> Attachments: crash.parquet
>
>
> On read of parquet file using pyarrow the program spontaneously exits no
> thrown exceptions windows only. Testing the same setup on linux (debian 10 in
> a Docker) reading the same parquet file is done without issue.
> The follow can reproduce the crash in a python 3.8.2 environment env listed
> bellow but is essentially pip install pandas and pyarrow.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_pandas_write_read():
> df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
> df_out.to_parquet("crash.parquet")
> df_in = pd.read_parquet("crash.parquet")
> print(df_in)
> def test_arrow_write_read():
> df = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
> table_out = pa.Table.from_pandas(df)
> pq.write_table(table_out, 'crash.parquet')
> table_in = pq.read_table('crash.parquet')
> print(table_in)
> if _name_ == "_main_":
> test_pandas_write_read()
> test_arrow_write_read()
> {code}
> The interpreter never reaches the print statements crashing somewhere in the
> call on line 252 of {{parquet.py}} no error is thrown just spontaneous
> program exit.
> {code:python}
> self.reader.read_all(...
> {code}
> In contrast running the same code and python environment in debian 10 there
> is no error reading the parquet files generated by the same windows code. The
> sha2sum compare equal for the crash.parquet generated running on debian and
> windows so something appears to be up with the read. Attached is the
> crash.parquet file generated on my machine.
> Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on
> windows.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)