HeartSaVioR opened a new pull request #28904:
URL: https://github.com/apache/spark/pull/28904


   ### What changes were proposed in this pull request?
   
   In many operations on CompactibleFileStreamLog reads a metadata log file and 
materializes all entries into memory. As the nature of the compact operation, 
CompactibleFileStreamLog may have a huge compact log file with bunch of entries 
included, and for now they're just monotonically increasing, which means the 
amount of memory to materialize also grows incrementally. This leads pressure 
on GC.
   
   This patch proposes to streamline the logic on file stream source and sink 
whenever possible to avoid memory issue. To make this possible we have to break 
the existing behavior of excluding entries - now the `compactLogs` method is 
called with all entries, which forces us to materialize all entries into 
memory. This is hopefully no effect on end users, because only file stream sink 
has a condition to exclude entries, and the condition has been never true. 
(DELETE_ACTION has been never set.)
   
   Based on the observation, this patch also changes the existing UT a bit 
which simulates the situation where "A" file is added, and another batch marks 
the "A" file as deleted. This situation simply doesn't work with the change, 
but as I mentioned earlier it hasn't been used. (I'm not sure the UT is from 
the actual run. I guess not.)
   
   ### Why are the changes needed?
   
   The memory issue (OOME) is reported by both JIRA issue and user mailing list.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   * Existing UTs
   * Manual test done
   
   The manual test leverages the simple apps which continuously writes the file 
stream sink metadata log.
   
   
https://github.com/HeartSaVioR/spark-delegation-token-experiment/commit/bea7680e4c588f455f8c3181a96c9eff5002fa1a
   
   The test is configured to have a batch metadata log file at 1.9M (10,000 
entries) whereas other Spark configuration is set to the default. (compact 
interval = 10) The app runs as driver, and the heap memory on driver is set to 
3g.
   
   > before the patch
   
   <img width="1094" alt="Screen Shot 2020-06-23 at 3 37 44 PM" 
src="https://user-images.githubusercontent.com/1317309/85375841-d94f3480-b571-11ea-817b-c6b48b34888a.png";>
   
   It only ran for 40 mins, with the latest compact batch file size as 1.3G. 
The process struggled with GC, and after some struggling, it threw OOME.
   
   > after the patch
   
   <img width="1094" alt="Screen Shot 2020-06-23 at 3 53 29 PM" 
src="https://user-images.githubusercontent.com/1317309/85375901-eff58b80-b571-11ea-837e-30d107f677f9.png";>
   
   It sustained 2 hours run, with the latest compact batch file size as 2.2G. 
The actual memory usage didn't even go up to 1.2G, and be cleaned up soon 
without outstanding GC activity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to