[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254171#comment-17254171
 ] 

Weston Pace commented on ARROW-9974:
------------------------------------

I believe what is happening is that the `ParquetDataset` approach is using more 
memory.  That is because the pyarrow.Table in-memory representation is larger 
than the pandas dataframe in-memory representation (in this case).  
Specifically for strings.  Arrow is going to store each string as an array of 
bytes + 4 bytes so each instance of 'ABCDEFGH' is going to occupy 12 bytes of 
RAM.  On the other hand, pandas is going to store an 8 byte pointer to an 8 
byte string.  If all the strings were different this would be mean 16 bytes per 
string but since they are all the same the string instance is shared so it is 
more or less 8 bytes.

So for the smaller table, arrow is using ~5.8GB while pandas is using ~4.4GB.  
This explains why it fails in 1.0.1 but does not explain...

> "256Gb memory way more than enough to load the data which requires < 10Gb"

For this I think the problem is simply that your system is not allowing each 
process to use 256GB.  With "vm.overcommit_memory = 2" the OS is going to avoid 
overcomitting entirely.  In addition, a large portion of RAM (~120 GB) is 
reserved for the kernel (this is tunable and you might want to consider tuning 
it since this is a rather large amount to reserve for the kernel).  The 
remaining 135953460KB (seen as CommitLimit in the meminfo) is shared across all 
processes.  Since overcomitting is disabled this is tracking the reserved (not 
used) RAM from all processes.

To confirm all of this I suggest two tests.

1) Confirm how much RAM is actually in use by python / pyarrow

Change the last lines of your script to...

 
{code:java}
try:
    read_errors()
except:
    max_bytes = pa.default_memory_pool().max_memory()
    input("Press Enter to continue...")
    print(f'Arrow bytes in use: {max_bytes}')
    raise
{code}
This will pause the program at the crash and allow you to inspect the memory.  
You can do this by looking up the process...

 

 
{code:java}
(base) [centos@ip-172-30-0-34 ~]$ ps -eaf | grep -i python
root         880       1  0 Dec18 ?        00:00:41 
/usr/libexec/platform-python -Es /usr/sbin/tuned -l -P
centos    228668  225417 90 17:45 pts/0    00:00:16 python experiment.py
centos    228680  228186  0 17:45 pts/1    00:00:00 grep --color=auto -i python
{code}
...and then lookup the RAM usage of the process.

 

 
{code:java}
(base) [centos@ip-172-30-0-34 ~]$ cat /proc/228668/status | grep -i vmdata
VmData:  3445600 kB
{code}
In addition, the experiment will print how many bytes arrow was using (to help 
distinguish from RAM used by the python heap and RAM reserved but not in use)...

 

 
{code:java}
Press Enter to continue...^[[A
Arrow bytes in use: 2601828992
Traceback (most recent call last):
{code}
So, even though my system has 8GB of RAM because 4GB is reserved for the 
kernel, ~0.5 GB is in use by other processes, ~1GB is in use by python, only 
~2.6GB remain for arrow.

 

2) Confirm how much RAM your system is currently allowing to be allocated.

Compile and run the following simple program...
{code:java}
#include <stdio.h>   
#include <stdlib.h>int MBYTE = 1024*1024;int main(void) {  int mbytes_allocated 
= 0;
  int ** pointers = malloc(MBYTE);
  while(mbytes_allocated < MBYTE) {
    int * pointer = malloc(MBYTE);
    if (pointer == 0) {
      for (int i = 0; i < mbytes_allocated; i++) {
        free(pointers[i]);
      }
      break;
    }
    mbytes_allocated++;
  }
  printf("Allocated %d megabytes before failing\n", mbytes_allocated);
}
{code}
{code:java}
 (base) [centos@ip-172-30-0-34 ~]$ gcc -o allocator allocator.c
 (base) [centos@ip-172-30-0-34 ~]$ ./allocator
 Allocated 3395 megabytes before failing
{code}
This matches pretty closely with what we were seeing in python.

As for a fix, in the new datasets API (available starting in 1.1 but more fully 
in 2.0) you can use 
[scan|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scan]
 which will allow you to convert to pandas incrementally and should have 
similar RAM usage to the first approach you had.  You may also want to tune 
your OS as you are reserving quite a bit for the kernel and that might be too 
much.  vm.overcommit_ratio defaults to 0.5 and that is often too aggressive for 
systems with large amounts of RAM.

 

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: dataset
>             Fix For: 3.0.0
>
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
>     # create a big dataframe
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
>     # below code works to read
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
> def read_errors():
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to