[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

Joris Van den Bossche (Jira) Thu, 28 Nov 2019 01:39:59 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-7059:
-----------------------------------------
    Description: 
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.


{code}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
print(pa.__version__)
%time res = pq.read_table("test_wide.parquet", use_threads=False)
{code}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 

  was:
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.

{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{print(pa.__version__)}}
use_threads=False
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 


> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-7059
>                 URL: https://issues.apache.org/jira/browse/ARROW-7059
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                32
> On-line CPU(s) list:   0-31
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 79
> Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>            Reporter: Eric Kisslinger
>            Priority: Major
>              Labels: parquet, performance
>             Fix For: 1.0.0
>
>         Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})
> pq.write_table(table, "test_wide.parquet")
> res = pq.read_table("test_wide.parquet")
> print(pa.__version__)
> %time res = pq.read_table("test_wide.parquet", use_threads=False)
> {code}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}
> {{}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

Reply via email to