[jira] [Commented] (ARROW-16391) pd.read_parquet using filters consumes too much memory

Weston Pace (Jira) Thu, 28 Apr 2022 11:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529576#comment-17529576
 ]


Weston Pace commented on ARROW-16391:
-------------------------------------

Release candidates for 8.0.0 are currently being voted on.  Exact timing will 
depend on how many RCs we need to build but it should be within the next month 
or so.

> pd.read_parquet using filters consumes too much memory
> ------------------------------------------------------
>
>                 Key: ARROW-16391
>                 URL: https://issues.apache.org/jira/browse/ARROW-16391
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 7.0.0
>         Environment:  Hardware Overview:
>       Model Name: MacBook Pro
>       Model Identifier: MacBookPro12,1
>       Processor Name: Dual-Core Intel Core i5
>       Processor Speed: 2.7 GHz
>       Number of Processors: 1
>       Total Number of Cores: 2
>       L2 Cache (per Core): 256 KB
>       L3 Cache: 3 MB
>       Hyper-Threading Technology: Enabled
>       Memory: 8 GB
>  System Software Overview:
>       System Version: macOS 10.15.7 (19H1217)
>       Kernel Version: Darwin 19.6.0
>       Boot Volume: Macintosh HD
>       Boot Mode: Normal
>            Reporter: Luis E Pastrana
>            Priority: Major
>
> Hello!
>  
> I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) 
> when trying to read a parquet using file-level filters. Using the following 
> dataset:
> {code:java}
> import pandas as pd
> import numpy as np
> a = np.random.randint(1,50,(4_000_000,4))
> df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", 
> index=False) {code}
> and the reader script ({*}read_with_filters.py{*})
> {code:java}
> import pyarrow as pa
> import pandas as pd
> print(f"pyarrow version: {pa.__version__}")
> print(f"pandas version: {pd.__version__}")
> tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, 
> filters=[("B","=",10)])
> print(tmp.shape) {code}
> I get:
>  
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : 
> 1.32{code}
> *Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 
> {code}
> *Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 
> {code}
>  
> It is more evident when using a larger dataset. However, my personal computer 
> hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. 
> (That should be a separate issue)
>  
> It is worth mentioning that you see a relative the same memory usage when I 
> removed the *filters* keyword.
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 
> {code}
> {*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 
> {code}
> {*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 
> {code}
> Thank you,
> Luis



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16391) pd.read_parquet using filters consumes too much memory

Reply via email to