Ark-kun opened a new issue, #45971:
URL: https://github.com/apache/arrow/issues/45971
### Describe the bug, including details regarding any error messages,
version, and platform.
We get data batches from BigQuery and write them to parquet. Parquet writer
eats up all memory and crashed the pod.
It's no JMalloc or whatever since we do not see the issue when we
periodically create new ParquetWriter instances.
This code leaks:
```py
from google.cloud import bigquery
from google.cloud.bigquery import _pandas_helpers
from pyarrow import parquet
client = bigquery.Client(project=...)
job = client.get_job(job_id=...)
result = job.result()
arrow_schema = _pandas_helpers.bq_to_arrow_schema(result.schema)
bqstorage_client = client._ensure_bqstorage_client()
with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema) as
writer:
for batch in result.to_arrow_iterable(
bqstorage_client=bqstorage_client,
max_queue_size=1,
max_stream_count=1,
):
writer.write_batch(batch)
```

Initially we though that the bug was in BigQuery, but we were wrong.
https://github.com/googleapis/python-bigquery/issues/2151
Proof:
Changing from
```
with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema) as
writer:
for batch in result.to_arrow_iterable(...):
writer.write_batch(batch)
```
to
```
for batch in result.to_arrow_iterable(...):
with parquet.ParquetWriter(where="result.parquet", schema=arrow_schema)
as writer:
writer.write_batch(batch)
```
fixes the memory leak.
Versions: "pyarrow==19.0.1", "pyarrow==16.1.0"
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]