baohe-zhang commented on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-655002716
I measured the memory usage of some smaller apps, the results are: * 200 jobs, 400 tasks for each job: 265 MB file size, 57.9 MB memory usage. * 100 jobs, 400 tasks for each job: 133 MB file size, 28.5 MB memory usage. * 50 jobs, 400 tasks for each job: 67 MB file size, 14.9 MB memory usage. * 20 jobs: 400 tasks for each job: 35 MB file size, 8.3 MB memory usage ----------------------------------------------------------------------------------------------- * 10 jobs: 400 tasks for each job: 15 MB file size, 4.7 MB memory usage. * 1 job, 400 tasks for each job: 3.7 MB file size, 2.2 MB memory usage. * 1 job, 40 tasks for each job: 512 KB file size, 727 KB memory usage. I found that the ratio of memory_usage / filesize is stable at ¼ for log files larger than 30 MB. For log files with size less than 15 MB, the ratio of memory_usage/ filesize is greater than ¼, and the ratio increases as the file size decrease. The difference in serving latency between in-memory and leveldb store depends on the machine's performance. But personally I feel parsing a log file greater than 50MB with the hybrid store can have a notable improvement. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
