[
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255642#comment-17255642
]
Antoine Pitrou commented on ARROW-9974:
---------------------------------------
> Why this is not an issue on windows 10, read somewhere windows 10 doesn't
> overcommit memory?
IIRC, jemalloc isn't enabled on Windows, so we either use mimalloc or the
system allocator there.
> So if overcommit is enabled vm.max_map_count will not increase that quickly?
If I'm reading correctly, yes.
Note that Arrow can use other memory allocators, in two different ways:
1) you can disable jemalloc at compile-time by passing {{-DARROW_JEMALLOC=OFF}}
to CMake
2) you can pass custom memory pools at runtime: see
[https://arrow.apache.org/docs/cpp/api/memory.html?highlight=mimalloc#memory-pools]
Related: ARROW-11009.
> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Ashish Gupta
> Assignee: Weston Pace
> Priority: Critical
> Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of
> memory: malloc of size 131072 failed". The same code on the same machine
> still works with older version. My machine has 256Gb memory way more than
> enough to load the data which requires < 10Gb. You can use below code to
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(50000000)})
> df['F1'] = np.random.randn(50000000) * 100
> df['F2'] = np.random.randn(50000000) * 100
> df['F3'] = np.random.randn(50000000) * 100
> df['F4'] = np.random.randn(50000000) * 100
> df['F5'] = np.random.randn(50000000) * 100
> df['F6'] = np.random.randn(50000000) * 100
> df['F7'] = np.random.randn(50000000) * 100
> df['F8'] = np.random.randn(50000000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)