baohe-zhang commented on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-653110422
Hi @HeartSaVioR @tgravescs , I measured the memory usage and disk usage for
a 1.21g log file and logs for the same application with different compression
codec. The log is generated by spark3 and parsed by spark3 SHS. The application
contains 400 jobs, each job contains one stage, each stage contains 1000 tasks.
| codec
| uncompressed | lz4 | lzf | snappy | zstd |
|
--------------------------------------------------------------------------------------------
| ------------ | -------- | -------- | -------- | -------- |
| log filesize
| 1.21 gb | 108 mb | 128 mb | 136 mb | 40 mb |
| actual memory usage (measure through Utils.SizeEstimator)
| 254.8 mb | 252.1 mb | 260.5 mb | 256.4 mb | 279.2
mb |
| estimated memory usage (log size / 2 for uncompressed log, log size \* 2
for compressed log) | 605 mb | 216 mb | 256 mb | 272 mb | 80 mb |
| disk usage (leveldb filesize)
| 393 mb | 398 mb | 403 mb | 395 mb | 424 mb |
From the result seems we are overestimating the memory usage of uncompressed
files and underestimate the memory usage of zstd compressed files. I think
filesize / 4 for uncompressed log, filesize * 4 for zstd compressed log might
be a better estimation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]