[
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204167#comment-17204167
]
Ashish Gupta commented on ARROW-9974:
-------------------------------------
I have test.py as below
{code:java}
{code}
{code:java}
import pyarrow.parquet as pq
fnames = []
for i in range(5000):
fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)
{code}
with use_legacy_dataset=True, there is no core dump, just the below error
{code:java}
Traceback (most recent call last):
File "test.py", line 9, in <module>
df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)
File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py",
line 1271, in read
table = piece.read(columns=columns, use_threads=use_threads,
File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py",
line 718, in read
table = reader.read(**options)
File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py",
line 326, in read
return self.reader.read_all(column_indices=column_indices,
File "pyarrow/_parquet.pyx", line 1125, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Out of memory: realloc of size 65600 failed
cannot allocate memory for thread-local data: ABORT{code}
with use_legacy_dataset=False, there is a core dump but I think there is some
issue reading the coredump file.
{code:java}
Reading symbols from /data/install/anaconda3/bin/python...done.
BFD: warning:
/data/dump/temp/core.python.862529738.cd627b19559c40969f42d3bb01c5e03d.739784.1601400696000000
is truncated: expected core file size >= 4752101376, found: 2147483648
warning: core file may not match specified executable file.
[New LWP 739784]
[New LWP 739791]
[New LWP 739795]
[New LWP 739788]
[New LWP 739785]
[New LWP 739792]
[New LWP 739790]
[New LWP 739794]
[New LWP 739793]
[New LWP 739789]
Cannot access memory at address 0x7f86e6283128
Cannot access memory at address 0x7f86e6283120
Failed to read a valid object file image from memory.
Core was generated by `/data/install/anaconda3/bin/python test.py'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f86e5aae70f in ?? ()
[Current thread is 1 (LWP 739784)]
(gdb) bt
#0 0x00007f86e5aae70f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ffcf1fb8c80
{code}
Would you be able to run the example on a machine with linux centos 8?
> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while
> reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Ashish Gupta
> Priority: Critical
> Labels: dataset
> Fix For: 2.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of
> memory: malloc of size 131072 failed". The same code on the same machine
> still works with older version. My machine has 256Gb memory way more than
> enough to load the data which requires < 10Gb. You can use below code to
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(50000000)})
> df['F1'] = np.random.randn(50000000) * 100
> df['F2'] = np.random.randn(50000000) * 100
> df['F3'] = np.random.randn(50000000) * 100
> df['F4'] = np.random.randn(50000000) * 100
> df['F5'] = np.random.randn(50000000) * 100
> df['F6'] = np.random.randn(50000000) * 100
> df['F7'] = np.random.randn(50000000) * 100
> df['F8'] = np.random.randn(50000000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)