[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253825#comment-17253825
 ] 

Weston Pace commented on ARROW-9974:
------------------------------------

I was able to reproduce memory issues at 65000000 rows.  However, this is 
always about the point where I would expect to see issues (since I was nearing 
the 8GB limit).  I did not see the errors you were seeing because the system 
was overcomitting and triggering the OOM killer instead of hard stopping the 
allocation.  Once I changed the overcommit mode to 2 (no overcommit) then I 
started getting malloc/realloc errors.  I'm wondering if this may be related to 
the issues you are seeing.  Can you share with me some more information?

Is this running on a VM or a dedicated server?

What is the output of `sysctl vm.overcommit_memory`?

What is the output of `cat /proc/meminfo`?

What is the output of `free`?

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: dataset
>             Fix For: 3.0.0
>
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
>     # create a big dataframe
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
>     # below code works to read
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
> def read_errors():
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to