liujiayi771 opened a new issue, #5766:
URL: https://github.com/apache/incubator-gluten/issues/5766
### Backend
VL (Velox)
### Bug description
When reading large CSV files, for example, when a single CSV file in a table
is 300M, the peak memory usage of arrow memory pool during single-threaded
reading can reach 500M. If the CSV is 2G, the peak memory usage can also
increase to 1.7G. It looks like there is no memory leak, but the peak memory
usage is very high.
From the code of Arrow Dataset, it seems that we are using the Streaming
reader, theoretically the memory consumption may not increase proportionally
with the size of the CSV file.
I have added some codes in the release method of ArrowNativeMemoryPool to
check the peak memory.
```java
@Override
public void release() throws Exception {
System.out.println("peak=" + listener.peak() +", current=" +
listener.current());
if (arrowPool.getBytesAllocated() != 0) {
LOGGER.warn(
String.format(
"Arrow pool still reserved non-zero bytes, "
+ "which may cause memory leak, size: %s. ",
Utils.bytesToString(arrowPool.getBytesAllocated())));
}
arrowPool.close();
}
```
I also added some logs in arrow codes to check the peak memory.
```c++
Result<RecordBatchGenerator> CsvFileFormat::ScanBatchesAsync(
const std::shared_ptr<ScanOptions>& scan_options,
const std::shared_ptr<FileFragment>& file) const {
auto this_ = checked_pointer_cast<const CsvFileFormat>(shared_from_this());
auto source = file->source();
auto reader_fut =
OpenReaderAsync(source, *this, scan_options,
::arrow::internal::GetCpuThreadPool());
auto generator = GeneratorFromReader(std::move(reader_fut),
scan_options->batch_size);
WRAP_ASYNC_GENERATOR_WITH_CHILD_SPAN(
generator, "arrow::dataset::CsvFileFormat::ScanBatchesAsync::Next");
std::cout << "memory=" << default_memory_pool()->bytes_allocated() << ",
max=" << default_memory_pool()->max_memory() << std::endl;
return generator;
}
```
<img width="542" alt="image"
src="https://github.com/apache/incubator-gluten/assets/13622031/02cd9643-12b0-4d1c-a426-1cfdeac77d76">
<img width="963" alt="image"
src="https://github.com/apache/incubator-gluten/assets/13622031/89fd4020-28d7-49fd-9063-442f1f21d359">
### Spark version
None
### Spark configurations
_No response_
### System information
_No response_
### Relevant logs
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]