Rohan Garg created PARQUET-2141:
-----------------------------------
Summary: Controlling memory utilization by ParquetReader
Key: PARQUET-2141
URL: https://issues.apache.org/jira/browse/PARQUET-2141
Project: Parquet
Issue Type: Improvement
Reporter: Rohan Garg
In Apache Druid, Parquet is one of the popular form of input source to ingest
data into a druid cluster
(https://druid.apache.org/docs/latest/development/extensions-core/parquet.html).
We rely on the parquet-mr library to read the parquet files and then convert
them into Druid native format row-for-row to ingest. A considerable amount of
our usecases ingest the whole parquet files (ie all columns in a single shot)
into the system.
A challenge that we face is that the parquet reader loads up an entire row
group into memory as part of its normal operation. Row groups can be quite
large (like, 1GB large) and sometimes it creates a pressure on our reader JVM
leading to OOMs. Further, in some other cases it ends up creating GC pressure
on the JVM leading to a decrease in the throughput of the ingestion tasks.
To mitigate this problem, we are considering that would it be better to have an
option to download the Parquet rowgroup/file first and memory-map it for
reading? The code which buffers the rowgroup works on the ByteBuffer interface
already
(https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1763),
so it seems like it could compliment the MMappedByteBuffer implementation too.
Such a thing would alleviate pressure off of our reader JVM there by heavily
reducing the chances for OOMs.
We're very open to more ideas or already tried solutions around this problem.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)