[
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254007#comment-17254007
]
Ashish Gupta commented on ARROW-9974:
-------------------------------------
This is a dedicated physical server.
sysctl vm.overcommit_memory
vm.overcommit_memory = 2
cat /proc/meminfo
MemTotal: 263518320 kB
MemFree: 34640640 kB
MemAvailable: 247394700 kB
Buffers: 52 kB
Cached: 217424924 kB
SwapCached: 5308 kB
Active: 175441652 kB
Inactive: 46026880 kB
Active(anon): 3637200 kB
Inactive(anon): 540420 kB
Active(file): 171804452 kB
Inactive(file): 45486460 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4194300 kB
SwapFree: 3900668 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 4025572 kB
Mapped: 350944 kB
Shmem: 185972 kB
KReclaimable: 3498500 kB
Slab: 5042144 kB
SReclaimable: 3498500 kB
SUnreclaim: 1543644 kB
KernelStack: 19744 kB
PageTables: 27404 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 135953460 kB
Committed_AS: 6058728 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
Percpu: 82240 kB
HardwareCorrupted: 0 kB
AnonHugePages: 548864 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 8881248 kB
DirectMap2M: 137568256 kB
DirectMap1G: 121634816 kB
free
total used free shared buff/cache available
Mem: 263518320 10488492 31968144 185980 221061684 244854292
Swap: 4194300 293632 3900668
> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Ashish Gupta
> Assignee: Weston Pace
> Priority: Critical
> Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of
> memory: malloc of size 131072 failed". The same code on the same machine
> still works with older version. My machine has 256Gb memory way more than
> enough to load the data which requires < 10Gb. You can use below code to
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(50000000)})
> df['F1'] = np.random.randn(50000000) * 100
> df['F2'] = np.random.randn(50000000) * 100
> df['F3'] = np.random.randn(50000000) * 100
> df['F4'] = np.random.randn(50000000) * 100
> df['F5'] = np.random.randn(50000000) * 100
> df['F6'] = np.random.randn(50000000) * 100
> df['F7'] = np.random.randn(50000000) * 100
> df['F8'] = np.random.randn(50000000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)