Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

via GitHub Tue, 28 May 2024 00:57:42 -0700


liujiayi771 commented on issue #5766:
URL: 
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134574129


   @jinchengchenghh I tested the latest code, and the peak memory usage is 
still relatively high. I did not add logs in 
`ArrowReservationListener.reserve`. Printing logs there did not output anything 
in my case. I added two methods in `ArrowReservationListener`, and printed peak 
and current in `ArrowNativeMemoryPool.release`.
   ```scala
   public long peak() {
     return sharedUsage.peak();
   }
   
   public long current() {
     return sharedUsage.current();
   }
   ```
   ```scala
   @Override
   public void release() throws Exception {
     System.out.println("peak=" + listener.peak() + ", current=" + 
listener.current());
     if (arrowPool.getBytesAllocated() != 0) {
       LOGGER.warn(
           String.format(
               "Arrow pool still reserved non-zero bytes, "
                   + "which may cause memory leak, size: %s. ",
               Utils.bytesToString(arrowPool.getBytesAllocated())));
     }
     arrowPool.close();
   }
   ```
   I created a Parquet table and used `spark-sql --local` to read the data from 
a CSV table to insert overwrite into the Parquet table. My dataset is a 100GB 
TPC-DS. I first tested the store_sales table, where each CSV file is 700MB in 
size. The log output is as follows, the peak memory is about 920MB:
   ```
   peak=964689920, current=8388608
   peak=964689920, current=8388608
   peak=964689920, current=8388608
   peak=964689920, current=8388608
   peak=964689920, current=8388608
   peak=956301312, current=8388608
   ```
   I continued testing the catalog_sales table, where each CSV file is 1.15GB 
in size. The log output is as follows, the peak memory is about 1064MB:
   ```
   peak=1124073472, current=8388608
   peak=1140850688, current=8388608
   peak=1115684864, current=8388608
   peak=1124073472, current=8388608
   peak=1149239296, current=8388608
   peak=1115684864, current=8388608
   ```
   I constructed a larger catalog_sales table with a single 30GB CSV file. The 
log output is as follows, the peak memory is about 6GB:
   ```
   peak=6601834496, current=8388608
   ```
   
   The peak memory logs that I printed should only be used by the CSV reader. 
But this issue is not that urgent for me at the moment. After splitting the 
large CSV file tool into smaller files, it still works normally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

Reply via email to