[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968488#comment-16968488 ]
Eric Kisslinger commented on ARROW-7059: ---------------------------------------- Thanks for the suggestion. I was unfamiliar with perf. Here are call graphs of the 10,000 column read_all() test using 0.14.1 and 0.15.1. The big difference seems to be with malloc related calls. 0.14.1 spends 0.51% of time calling {{new}} and 0.75% {{_int_free}}. Whereas, 0.15.1 spends 39.51% and 25.40% respectively. *0.14.1* !image-2019-11-06-08-19-11-662.png! *0.15.1* !image-2019-11-06-08-25-05-885.png! > [Python] Reading parquet file with many columns is much slower in 0.15.x > versus 0.14.x > -------------------------------------------------------------------------------------- > > Key: ARROW-7059 > URL: https://issues.apache.org/jira/browse/ARROW-7059 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Environment: Linux OS with RHEL 7.7 distribution > blkcqas037:~$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 32 > On-line CPU(s) list: 0-31 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 79 > Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz > Reporter: Eric Kisslinger > Priority: Major > Labels: performance > Attachments: image-2019-11-06-08-18-42-783.png, > image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, > image-2019-11-06-08-25-05-885.png > > > Reading Parquet files with large number of columns still seems to be very > slow in 0.15.1 compared to 0.14.1. I using the same test used in > https://issues.apache.org/jira/browse/ARROW-6876 except I set > {{use_threads=False}} to make for an apples-to-apples comparison with respect > to # of CPUs. > {{import numpy as np}} > {{import pyarrow as pa}} > {{import pyarrow.parquet as pq}} > {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in > range(10000)})}} > {{pq.write_table(table, "test_wide.parquet")}} > {{res = pq.read_table("test_wide.parquet")}} > {{print(pa.__version__)}} > use_threads=False > {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} > *In 0.14.1 with use_threads=False:* > {{0.14.1}} > {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} > {{Wall time: 525 ms}} > ** > *In 0.15.1 with* *use_threads=False**:* > {{0.15.1}} > {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} > {{Wall time: 9.93 s}} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)