liujiayi771 commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134574129
@jinchengchenghh I tested the latest code, and the peak memory usage is
still relatively high. I did not add logs in
`ArrowReservationListener.reserve`. Printing logs there did not output anything
in my case. I added two methods in `ArrowReservationListener`, and printed peak
and current in `ArrowNativeMemoryPool.release`.
```scala
public long peak() {
return sharedUsage.peak();
}
public long current() {
return sharedUsage.current();
}
```
```scala
@Override
public void release() throws Exception {
System.out.println("peak=" + listener.peak() + ", current=" +
listener.current());
if (arrowPool.getBytesAllocated() != 0) {
LOGGER.warn(
String.format(
"Arrow pool still reserved non-zero bytes, "
+ "which may cause memory leak, size: %s. ",
Utils.bytesToString(arrowPool.getBytesAllocated())));
}
arrowPool.close();
}
```
I created a Parquet table and used `spark-sql --local` to read the data from
a CSV table to insert overwrite into the Parquet table. My dataset is a 100GB
TPC-DS. I first tested the store_sales table, where each CSV file is 700MB in
size. The log output is as follows, the peak memory is about 920MB:
```
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=956301312, current=8388608
```
I continued testing the catalog_sales table, where each CSV file is 1.15GB
in size. The log output is as follows, the peak memory is about 1064MB:
```
peak=1124073472, current=8388608
peak=1140850688, current=8388608
peak=1115684864, current=8388608
peak=1124073472, current=8388608
peak=1149239296, current=8388608
peak=1115684864, current=8388608
```
I constructed a larger catalog_sales table with a single 30GB CSV file. The
log output is as follows, the peak memory is about 6GB:
```
peak=6601834496, current=8388608
```
The peak memory logs that I printed should only be used by the CSV reader.
But this issue is not that urgent for me at the moment. After splitting the
large CSV file tool into smaller files, it still works normally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]