[
https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454944#comment-17454944
]
Qingxiang Chen commented on ARROW-14987:
----------------------------------------
Thank you for your answer, the scheme of increasing the chunk size really
helped me a lot. I have two more questions.
1. What are the advantages of jemalloc compared to the default malloc, and why
it reduces the memory usage so much. Should I replace my system malloc with
jemalloc to achieve optimization 1?
2. How do you use jemalloc to view the allocated memory? If I want to analyze
related memory problems in my system, what should I do?
> [C++]Memory leak while reading parquet file
> -------------------------------------------
>
> Key: ARROW-14987
> URL: https://issues.apache.org/jira/browse/ARROW-14987
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 6.0.1
> Reporter: Qingxiang Chen
> Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still
> high after the function ended. I reproduced this problem in the example. code
> show as below:
>
> {code:c++}
> #include <arrow/api.h>
> #include <arrow/io/api.h>
> #include <parquet/arrow/reader.h>
> #include <parquet/arrow/writer.h>
> #include <parquet/exception.h>
> #include <unistd.h>
> #include <iostream>
> std::shared_ptr<arrow::Table> generate_table() {
> arrow::Int64Builder i64builder;
> for (int i=0;i<320000;i++){
> i64builder.Append(i);
> }
> std::shared_ptr<arrow::Array> i64array;
> PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
> std::shared_ptr<arrow::Schema> schema = arrow::schema(
> {arrow::field("int", arrow::int64())});
> return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
> std::shared_ptr<arrow::io::FileOutputStream> outfile;
> PARQUET_ASSIGN_OR_THROW(
> outfile,
> arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
> PARQUET_THROW_NOT_OK(
> parquet::arrow::WriteTable(table, arrow::default_memory_pool(),
> outfile, 3));
> }
> void read_whole_file() {
> std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
> std::shared_ptr<arrow::io::ReadableFile> infile;
> PARQUET_ASSIGN_OR_THROW(infile,
>
> arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
>
> arrow::default_memory_pool()));
> std::unique_ptr<parquet::arrow::FileReader> reader;
> PARQUET_THROW_NOT_OK(
> parquet::arrow::OpenFile(infile, arrow::default_memory_pool(),
> &reader));
> std::shared_ptr<arrow::Table> table;
> PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
> std::cout << "Loaded " << table->num_rows() << " rows in " <<
> table->num_columns()
> << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
> std::shared_ptr<arrow::Table> table = generate_table();
> write_parquet_file(*table);
> std::cout << "start " <<std::endl;
> read_whole_file();
> std::cout << "end " <<std::endl;
> sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has
> not dropped. When I increase the data volume by 5 times, the memory usage is
> about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or
> whether it is a memory leak problem. If there is no memory leak, how to set
> memory pool size or release memory?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)