ondrej metelka created ARROW-16028:
--------------------------------------

             Summary: Memory leak in `fragment.to_table`
                 Key: ARROW-16028
                 URL: https://issues.apache.org/jira/browse/ARROW-16028
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 6.0.1
            Reporter: ondrej metelka


This "pseudo" code ends with OOM.

 
{code:java}
import fsspec
import pyarrow
import pyarrow.parquet as pq

fs = fsspec.filesystem(
    "s3",
    default_cache_type="none",
    default_fill_cache=False,
    **our_storage_options,
)
dataset = pq.ParquetDataset(
    "path in bucket",
    filesystem=fs,
    filters=some_filters,
    use_legacy_dataset=False,
)

# this ends with OOM
dataset.read(columns=columns_to_read)

# and this too
tables = []
for fragment in dataset.fragments:
   tables.append(fragment.to_table(columns=columns_to_read))
all_data = pyarrow.lib.concat_tables(tables) {code}
What is really weird is if we put a debug point in the loop and *load* just 
{*}one fragment{*}. It loads, but something *keeps eating memory after load* 
until there is no left.

We are trying to read a parquet table that has several files under desired 
partitions. Each fragment has tens of columns and tens of millions of rows.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to