[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204167#comment-17204167
 ] 

Ashish Gupta commented on ARROW-9974:
-------------------------------------

I have test.py as below
{code:java}
 {code}
 
{code:java}
import pyarrow.parquet as pq
fnames = []
for i in range(5000):
 fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)
{code}
 

with use_legacy_dataset=True, there is no core dump, just the below error

 
{code:java}
Traceback (most recent call last):
 File "test.py", line 9, in <module>
 df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", 
line 1271, in read
 table = piece.read(columns=columns, use_threads=use_threads,
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", 
line 718, in read
 table = reader.read(**options)
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", 
line 326, in read
 return self.reader.read_all(column_indices=column_indices,
 File "pyarrow/_parquet.pyx", line 1125, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Out of memory: realloc of size 65600 failed
cannot allocate memory for thread-local data: ABORT{code}
 

 

with use_legacy_dataset=False, there is a core dump but I think there is some 
issue reading the coredump file.

 
{code:java}
Reading symbols from /data/install/anaconda3/bin/python...done.
BFD: warning: 
/data/dump/temp/core.python.862529738.cd627b19559c40969f42d3bb01c5e03d.739784.1601400696000000
 is truncated: expected core file size >= 4752101376, found: 2147483648
warning: core file may not match specified executable file.
[New LWP 739784]
[New LWP 739791]
[New LWP 739795]
[New LWP 739788]
[New LWP 739785]
[New LWP 739792]
[New LWP 739790]
[New LWP 739794]
[New LWP 739793]
[New LWP 739789]
Cannot access memory at address 0x7f86e6283128
Cannot access memory at address 0x7f86e6283120
Failed to read a valid object file image from memory.
Core was generated by `/data/install/anaconda3/bin/python test.py'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f86e5aae70f in ?? ()
[Current thread is 1 (LWP 739784)]
(gdb) bt
#0 0x00007f86e5aae70f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ffcf1fb8c80
{code}
 

 

Would you be able to run the example on a machine with linux centos 8?

 

 

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Priority: Critical
>              Labels: dataset
>             Fix For: 2.0.0
>
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
>     # create a big dataframe
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
>     # below code works to read
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
> def read_errors():
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to