ondrej metelka created ARROW-16028:
--------------------------------------
Summary: Memory leak in `fragment.to_table`
Key: ARROW-16028
URL: https://issues.apache.org/jira/browse/ARROW-16028
Project: Apache Arrow
Issue Type: Bug
Components: Parquet, Python
Affects Versions: 6.0.1
Reporter: ondrej metelka
This "pseudo" code ends with OOM.
{code:java}
import fsspec
import pyarrow
import pyarrow.parquet as pq
fs = fsspec.filesystem(
"s3",
default_cache_type="none",
default_fill_cache=False,
**our_storage_options,
)
dataset = pq.ParquetDataset(
"path in bucket",
filesystem=fs,
filters=some_filters,
use_legacy_dataset=False,
)
# this ends with OOM
dataset.read(columns=columns_to_read)
# and this too
tables = []
for fragment in dataset.fragments:
tables.append(fragment.to_table(columns=columns_to_read))
all_data = pyarrow.lib.concat_tables(tables) {code}
What is really weird is if we put a debug point in the loop and *load* just
{*}one fragment{*}. It loads, but something *keeps eating memory after load*
until there is no left.
We are trying to read a parquet table that has several files under desired
partitions. Each fragment has tens of columns and tens of millions of rows.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)