thomas-pfeiffer opened a new issue, #3168:
URL: https://github.com/apache/iceberg-python/issues/3168
### Apache Iceberg version
0.11.0 (latest release)
### Please describe the bug 🐞
We have a table with a column containing large JSON strings (basically think
of large `pydantic` validation results). The resulting datafile in Iceberg
itself is a roughly 6MB (GZip compressed) Parquet file, but when querying it,
the memory consumption goes to ~4GB. (Just for a few seconds, but long enough
to cause out-of-memory issues on some systems.)
The issue is that by default `pyarrow` loads the strings per row into
memory, which blows up the memory. If we download the datafile and open it
directly via `pyarrow` this behaviour can be reproduced.
There are 2 workarounds in `pyarrow`: 1. Don't load the problematic column
(given that is possible in your use case) and 2. switch to dictionary-encoding
for set column (example snippet below).
```py
from pyarrow.parquet import read_table
table = read_table("datafile.parquet",
read_dictionary=["problematic_column"])
```
Issue in `pyiceberg`:
Regardless if you use `table.scan(...).to_arrow()` or
`table.scan(...).to_arrow_batch_reader()`, `pyiceberg` has afaik currently no
option to specify the dictionary encoding for certain tables, hence `pyarrow`
uses the default encoding and the memory usage explodes.
The `to_arrow_batch_reader` does not help here either, because -as per my
understanding- in the batch reader of `pyiceberg` each batch represents an
individual datafile. Hence, if there is one problematic 6MB datafile, it makes
no difference if you use the batch reader or not. I also have the impression
that when you iterate over the reader, `pyarrow` has already loaded the parquet
file in a separate thread and this is where the memory explosion actually
happens.
So the current only workaround in `pyiceberg` is option 1: Don't load the
problematic column by specifying the `selected_fields`:
```py
from pyiceberg.catalog import load_catalog
catalog = load_catalog("default")
table = catalog.load_table("your_table")
reader = table.scan(selected_fields=("all", "other",
"columns")).to_arrow_batch_reader()
for batch in reader:
...
```
Expected behaviour:
There should be an option somewhere, e.g. in the data_scan to specify for
which columns dictionary encoding should be used. This option should be
forwarded to `pyarrow` internally somehow, so that `pyarrow` uses less memory.
Remark:
I would not change the default behaviour. It would be just good to have the
option to configure the encoding in pyarrow when needed.
This issue is a follow up for
https://github.com/apache/iceberg-python/issues/1205
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]