[
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254642#comment-17254642
]
Ashish Gupta commented on ARROW-9974:
-------------------------------------
If the system memory limit is the issue, would it have worked on the same
machine with the older version of pyarrow? My code was working perfectly fine
with 0.13.0. Regarding tests you asked me...
1) Confirm how much RAM is actually in use by python / pyarrow
read_error crashes with a core dump, so I am not able to use try/except block.
{code:java}
python test.py
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped){code}
However I was observed in top command that memory usage was around 5Gb when it
crashed.
2) ./allocator
Allocated 110641 megabytes before failing
With read_works I checked the max memory required for my example is 15Gb. So
given that 110 Gb is available and there is nothing else running it doesn't
make sense.
> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Ashish Gupta
> Assignee: Weston Pace
> Priority: Critical
> Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of
> memory: malloc of size 131072 failed". The same code on the same machine
> still works with older version. My machine has 256Gb memory way more than
> enough to load the data which requires < 10Gb. You can use below code to
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(50000000)})
> df['F1'] = np.random.randn(50000000) * 100
> df['F2'] = np.random.randn(50000000) * 100
> df['F3'] = np.random.randn(50000000) * 100
> df['F4'] = np.random.randn(50000000) * 100
> df['F5'] = np.random.randn(50000000) * 100
> df['F6'] = np.random.randn(50000000) * 100
> df['F7'] = np.random.randn(50000000) * 100
> df['F8'] = np.random.randn(50000000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)