[
https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454354#comment-17454354
]
Weston Pace edited comment on ARROW-14987 at 12/7/21, 3:42 AM:
---------------------------------------------------------------
*TL:DR; A chunk_size of 3 is way too low.*
Thank you so much for the detailed reproduction.
h3. Some notes
First, I used 5 times the amount of data that you were working with. This
works out to 12.5MB of int64_t "data"
Second, you are not releasing the variable named "table" in your main method.
This holds on to 12.5MB of RAM. I added table.reset() before the sleep to take
care of this.
Third, a chunk size of 3 is pathologically small. This means parquet is going
to have to write row group metadata after every 3 rows of data. As a result,
the parquet file, which only contains 12.5MB of real data, requires 169MB.
This means there is ~157MB of metadata. A chunk size should, at a minimum, be
in the tens of thousands, and often is in the millions.
*When I run this test I end up with nearly 1GB of memory usage! Even given the
erroneously large parquet file this seems like way too much*
h3. Figuring out Arrow memory pool usage
One helpful tool when determining how much RAM Arrow is using is to print out
how many bytes Arrow thinks it is holding onto. To do this you can add...
{noformat}
std::cout << arrow::default_memory_pool()->bytes_allocated() << "
bytes_allocated" << std::endl;
{noformat}
Assuming you add the "table.reset()" call this should print "0 bytes allocated"
which means that Arrow is not holding on to any memory.
The second common thing to get blamed is jemalloc. Arrow uses jemalloc (or
possibly mimalloc) internally in its memory pools and these allocators
sometimes over-allocate and sometimes hold onto memory for a little while.
However, this seems unlikely because jemalloc is configured by default by Arrow
to release over-allocated memory every 1 second.
To verify I built an instrumented version of Arrow to print stats for its
internal jemalloc pool after 5 seconds of being idle. I got:
{noformat}
Allocated: 29000, active: 45056, metadata: 6581448 (n_thp 0), resident:
6606848, mapped: 12627968, retained: 125259776
{noformat}
This means Arrow has 29KB of data actively allocated (this is curious, given
bytes_allocated is 0, and worth investigation at a later date, but certainly
not the culprit here).
That 29KB of active data spans 45.056KB of pages (this is what people refer to
when they talk about fragmentation). There is also 6.58MB of jemalloc
metadata. I'm pretty sure this is rather independent of the workload and not
something to worry too much about.
Combined, this 45.056KB of data and 6.58MB of metadata is occupying 6.61MB of
RSS. So far so good.
h3. Figuring out the rest of the memory usage
There is only one other place the remaining memory usage can be, which is the
application's global system allocator. To debug this further I built my test
application with jemalloc (a different jemalloc instance than the one running
Arrow). This means Arrow's memory pool will use one instance of jemalloc and
everything else will use my own instance of jemalloc. Printing stats I get:
{noformat}
Allocated: 257904, active: 569344, metadata: 15162288 (n_thp 0), resident:
950906880, mapped: 958836736, retained: 648630272
{noformat}
Now we have found our culprit. There is about 258KB allocated and it occupied
569KB worth of pages and 15MB of jemalloc metadata. This is pretty reasonable
and makes sense (this is memory used by shared pointers and various metadata
objects. It seems pretty appropriate.
_However, this ~15MB of data is occupying nearly 1GB of RSS!_
To debug further I used jemalloc's memory profiling to track where all of these
allocations were happening. It turns out most of these allocations were in the
parquet reader itself. While the table built will eventually be constructed in
Arrow's memory pool the parquet reader does not use the memory pool for the
various allocations needed to operate the reader itself.
So, putting this all together into a hypothesis...
The chunk size of 3 means we have a ton of metadata. This metadata gets
allocated by the parquet reader in lots of very small allocations. These
allocations have terrible fragmentation and the system allocator ends up
scattering this information across a wide swath of RSS and results in a large
amount of over-allocation.
h3. Fixes
h4. Fix 1: Use more jemalloc
Since my test was already using jemalloc I can configure jemalloc the same way
Arrow does by enabling the background thread and setting it to purge on a 1
second interval. Now, running my test, after 5 seconds of inactivity I get the
following from the global jemalloc:
{noformat}
Allocated: 246608, active: 544768, metadata: 15155760 (n_thp 0), resident:
15675392, mapped: 23613440, retained: 1382526976
{noformat}
We now see that same ~15MB of data and jemalloc metadata is now spread across
15.6MB of RSS (pretty great fragmentation support). I can confirm this by
looking at the RSS of the process which reports 25MB (most of which is
explained by the two jemalloc instance's metadata) which is a massive
improvement over 1GB.
h4. Fix 2: Use a sane chunk size
If I change the chunk size to 100,000 then suddenly parquet is not making so
many tiny allocations (my program runs much faster) and I get the following
stats for the global jemalloc instance:
{noformat}
Allocated: 1756168, active: 2027520, metadata: 4492600 (n_thp 0), resident:
6496256, mapped: 8318976, retained: 64557056
{noformat}
And I see only 18.5MB of RSS usage.
was (Author: westonpace):
*TL:DR; A chunk_size of 3 is way too low.*
Thank you so much for the detailed reproduction.
# Some notes
First, I used 5 times the amount of data that you were working with. This
works out to 12.5MB of int64_t "data"
Second, you are not releasing the variable named "table" in your main method.
This holds on to 12.5MB of RAM. I added table.reset() before the sleep to take
care of this.
Third, a chunk size of 3 is pathologically small. This means parquet is going
to have to write row group metadata after every 3 rows of data. As a result,
the parquet file, which only contains 12.5MB of real data, requires 169MB.
This means there is ~157MB of metadata. A chunk size should, at a minimum, be
in the tens of thousands, and often is in the millions.
*When I run this test I end up with nearly 1GB of memory usage! Even given the
erroneously large parquet file this seems like way too much*
# Figuring out Arrow memory pool usage
One helpful tool when determining how much RAM Arrow is using is to print out
how many bytes Arrow thinks it is holding onto. To do this you can add...
{noformat}
std::cout << arrow::default_memory_pool()->bytes_allocated() << "
bytes_allocated" << std::endl;
{noformat}
Assuming you add the "table.reset()" call this should print "0 bytes allocated"
which means that Arrow is not holding on to any memory.
The second common thing to get blamed is jemalloc. Arrow uses jemalloc (or
possibly mimalloc) internally in its memory pools and these allocators
sometimes over-allocate and sometimes hold onto memory for a little while.
However, this seems unlikely because jemalloc is configured by default by Arrow
to release over-allocated memory every 1 second.
To verify I built an instrumented version of Arrow to print stats for its
internal jemalloc pool after 5 seconds of being idle. I got:
{noformat}
Allocated: 29000, active: 45056, metadata: 6581448 (n_thp 0), resident:
6606848, mapped: 12627968, retained: 125259776
{noformat}
This means Arrow has 29KB of data actively allocated (this is curious, given
bytes_allocated is 0, and worth investigation at a later date, but certainly
not the culprit here).
That 29KB of active data spans 45.056KB of pages (this is what people refer to
when they talk about fragmentation). There is also 6.58MB of jemalloc
metadata. I'm pretty sure this is rather independent of the workload and not
something to worry too much about.
Combined, this 45.056KB of data and 6.58MB of metadata is occupying 6.61MB of
RSS. So far so good.
# Figuring out the rest of the memory usage
There is only one other place the remaining memory usage can be, which is the
application's global system allocator. To debug this further I built my test
application with jemalloc (a different jemalloc instance than the one running
Arrow). This means Arrow's memory pool will use one instance of jemalloc and
everything else will use my own instance of jemalloc. Printing stats I get:
{noformat}
Allocated: 257904, active: 569344, metadata: 15162288 (n_thp 0), resident:
950906880, mapped: 958836736, retained: 648630272
{noformat}
Now we have found our culprit. There is about 258KB allocated and it occupied
569KB worth of pages and 15MB of jemalloc metadata. This is pretty reasonable
and makes sense (this is memory used by shared pointers and various metadata
objects. It seems pretty appropriate.
_However, this ~15MB of data is occupying nearly 1GB of RSS!_
To debug further I used jemalloc's memory profiling to track where all of these
allocations were happening. It turns out most of these allocations were in the
parquet reader itself. While the table built will eventually be constructed in
Arrow's memory pool the parquet reader does not use the memory pool for the
various allocations needed to operate the reader itself.
So, putting this all together into a hypothesis...
The chunk size of 3 means we have a ton of metadata. This metadata gets
allocated by the parquet reader in lots of very small allocations. These
allocations have terrible fragmentation and the system allocator ends up
scattering this information across a wide swath of RSS and results in a large
amount of over-allocation.
# Fixes
## Fix 1: Use more jemalloc
Since my test was already using jemalloc I can configure jemalloc the same way
Arrow does by enabling the background thread and setting it to purge on a 1
second interval. Now, running my test, after 5 seconds of inactivity I get the
following from the global jemalloc:
{noformat}
Allocated: 246608, active: 544768, metadata: 15155760 (n_thp 0), resident:
15675392, mapped: 23613440, retained: 1382526976
{noformat}
We now see that same ~15MB of data and jemalloc metadata is now spread across
15.6MB of RSS (pretty great fragmentation support). I can confirm this by
looking at the RSS of the process which reports 25MB (most of which is
explained by the two jemalloc instance's metadata) which is a massive
improvement over 1GB.
## Fix 2: Use a sane chunk size
If I change the chunk size to 100,000 then suddenly parquet is not making so
many tiny allocations (my program runs much faster) and I get the following
stats for the global jemalloc instance:
{noformat}
Allocated: 1756168, active: 2027520, metadata: 4492600 (n_thp 0), resident:
6496256, mapped: 8318976, retained: 64557056
{noformat}
And I see only 18.5MB of RSS usage.
> [C++]Memory leak while reading parquet file
> -------------------------------------------
>
> Key: ARROW-14987
> URL: https://issues.apache.org/jira/browse/ARROW-14987
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 6.0.1
> Reporter: Qingxiang Chen
> Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still
> high after the function ended. I reproduced this problem in the example. code
> show as below:
>
> {code:c++}
> #include <arrow/api.h>
> #include <arrow/io/api.h>
> #include <parquet/arrow/reader.h>
> #include <parquet/arrow/writer.h>
> #include <parquet/exception.h>
> #include <unistd.h>
> #include <iostream>
> std::shared_ptr<arrow::Table> generate_table() {
> arrow::Int64Builder i64builder;
> for (int i=0;i<320000;i++){
> i64builder.Append(i);
> }
> std::shared_ptr<arrow::Array> i64array;
> PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
> std::shared_ptr<arrow::Schema> schema = arrow::schema(
> {arrow::field("int", arrow::int64())});
> return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
> std::shared_ptr<arrow::io::FileOutputStream> outfile;
> PARQUET_ASSIGN_OR_THROW(
> outfile,
> arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
> PARQUET_THROW_NOT_OK(
> parquet::arrow::WriteTable(table, arrow::default_memory_pool(),
> outfile, 3));
> }
> void read_whole_file() {
> std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
> std::shared_ptr<arrow::io::ReadableFile> infile;
> PARQUET_ASSIGN_OR_THROW(infile,
>
> arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
>
> arrow::default_memory_pool()));
> std::unique_ptr<parquet::arrow::FileReader> reader;
> PARQUET_THROW_NOT_OK(
> parquet::arrow::OpenFile(infile, arrow::default_memory_pool(),
> &reader));
> std::shared_ptr<arrow::Table> table;
> PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
> std::cout << "Loaded " << table->num_rows() << " rows in " <<
> table->num_columns()
> << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
> std::shared_ptr<arrow::Table> table = generate_table();
> write_parquet_file(*table);
> std::cout << "start " <<std::endl;
> read_whole_file();
> std::cout << "end " <<std::endl;
> sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has
> not dropped. When I increase the data volume by 5 times, the memory usage is
> about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or
> whether it is a memory leak problem. If there is no memory leak, how to set
> memory pool size or release memory?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)