[
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255332#comment-17255332
]
Weston Pace commented on ARROW-9974:
------------------------------------
Ok. I think I've really tracked it down now. It appears the root cause is a
combination of jemalloc, aggressive muzzy decay, and disabling overcommit.
Jemalloc is tracking the issue
[[https://github.com/jemalloc/jemalloc/issues/1328]|https://github.com/jemalloc/jemalloc/issues/1328].]
Jemmalloc ends up creating many many small mmaps and eventually the
vm.max_map_count is reached. You can see the value of this limit here.
{code:java}
[centos@ip-172-30-0-42 ~]$ sysctl vm.max_map_count
vm.max_map_count = 65530
{code}
To confirm this is indeed your issue you will need to pause it at crash time
(either using gdb or python's except/input as discussed above) and count the
number of maps...
{code:java}
[centos@ip-172-30-0-42 ~]$ cat /proc/1829/maps | wc -l
65532
{code}
(note, this approach will give an approximation of the # of maps and not the
exact count, but it shouldn't be anywhere close to the limit under normal
operation).
---
The preferred workaround per the jemalloc issue is to enable overcommit. You
can configure the system to prioritize killing the process using arrow if the
oom killer is too unpredictable.
If overcommit must be disabled, for whatever reason, then you could always
compile arrow without jemalloc.
Finally, in the issue listed above there is some configuration that is
suggested, basically reducing the rate at which jemalloc returns pages to the
OS. This jemalloc configuration would not be universally applicable however so
it doesn't make sense for Arrow to change these defaults. These settings are
also not configurable at the moment so this option isn't really possible given
the current code.
> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Ashish Gupta
> Assignee: Weston Pace
> Priority: Critical
> Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of
> memory: malloc of size 131072 failed". The same code on the same machine
> still works with older version. My machine has 256Gb memory way more than
> enough to load the data which requires < 10Gb. You can use below code to
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(50000000)})
> df['F1'] = np.random.randn(50000000) * 100
> df['F2'] = np.random.randn(50000000) * 100
> df['F3'] = np.random.randn(50000000) * 100
> df['F4'] = np.random.randn(50000000) * 100
> df['F5'] = np.random.randn(50000000) * 100
> df['F6'] = np.random.randn(50000000) * 100
> df['F7'] = np.random.randn(50000000) * 100
> df['F8'] = np.random.randn(50000000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)