[
https://issues.apache.org/jira/browse/ARROW-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183268#comment-17183268
]
Joris Van den Bossche commented on ARROW-9827:
----------------------------------------------
As I suspected, this is due to the reading of metadata, and especially the
column statistics. In the Dataset API, we by default parse all metadata
including statistics, even when there is no filter specified.
When disabling the parsing of statistics, the file with 40,000 columns reads in
2.5 seconds (will open a PR with that).
Further, with the released version of pyarrow, using ParquetFile is actually
the fastest option right now:
{code}
In [7]: %time table = pq.ParquetFile("test.parquet").read()
CPU times: user 7.89 s, sys: 253 ms, total: 8.14 s
Wall time: 4.5 s
{code}
Additional sidenote: the Parquet file format is generally not well suited for
such wide tables, because of the column-heavy metadata.
For example, reading a Parquet file with the same number of values, but in long
format (40,000 rows x 10 columns) reads in less than 10ms instead of in
multiple seconds)
> [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X
> ---------------------------------------------------------------------------
>
> Key: ARROW-9827
> URL: https://issues.apache.org/jira/browse/ARROW-9827
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0
> Reporter: Kyle Beauchamp
> Priority: Major
>
> I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm
> encountering a serious bug where wide DataFrames fail during
> pandas.read_parquet. Small parquet files (m=10000) read correctly, medium
> files (m=40000) fail with a "Bus Error: 10", and large files (m=100000)
> completely hang. I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and
> OSX 10.14.
> The driver code and output is below:
> {code:python}
> import pandas as pd
> import numpy as np
> import sys
> filename = "test.parquet"
> n = 10
> m = int(sys.argv[1])
> print(m)
> x = np.zeros((n, m))
> x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)])
> x.to_parquet(filename)
> y = pd.read_parquet(filename, engine='pyarrow')
> {code}
> {code:java}
> time python test_pyarrow.py 10000
> real 0m4.018s user 0m5.286s sys 0m0.514s
> time python test_pyarrow.py 40000
> 40000
> Bus error: 10
> {code}
>
> On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds.
> This was cross-posted on the pandas tracker as well:
> [https://github.com/pandas-dev/pandas/issues/35846]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)