[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

Ashish Gupta (Jira) Wed, 23 Dec 2020 02:05:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254007#comment-17254007
 ]


Ashish Gupta commented on ARROW-9974:
-------------------------------------

This is a dedicated physical server.

sysctl vm.overcommit_memory
vm.overcommit_memory = 2

 

cat /proc/meminfo
MemTotal: 263518320 kB
MemFree: 34640640 kB
MemAvailable: 247394700 kB
Buffers: 52 kB
Cached: 217424924 kB
SwapCached: 5308 kB
Active: 175441652 kB
Inactive: 46026880 kB
Active(anon): 3637200 kB
Inactive(anon): 540420 kB
Active(file): 171804452 kB
Inactive(file): 45486460 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4194300 kB
SwapFree: 3900668 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 4025572 kB
Mapped: 350944 kB
Shmem: 185972 kB
KReclaimable: 3498500 kB
Slab: 5042144 kB
SReclaimable: 3498500 kB
SUnreclaim: 1543644 kB
KernelStack: 19744 kB
PageTables: 27404 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 135953460 kB
Committed_AS: 6058728 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
Percpu: 82240 kB
HardwareCorrupted: 0 kB
AnonHugePages: 548864 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 8881248 kB
DirectMap2M: 137568256 kB
DirectMap1G: 121634816 kB

 

free
 total used free shared buff/cache available
Mem: 263518320 10488492 31968144 185980 221061684 244854292
Swap: 4194300 293632 3900668

 

 

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: dataset
>             Fix For: 3.0.0
>
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
>     # create a big dataframe
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
>     # below code works to read
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
> def read_errors():
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

Reply via email to