[ https://issues.apache.org/jira/browse/MAPREDUCE-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714187#comment-17714187 ]
ASF GitHub Bot commented on MAPREDUCE-7435: ------------------------------------------- steveloughran commented on PR #5519: URL: https://github.com/apache/hadoop/pull/5519#issuecomment-1514982508 updated pr has been run through azure, with stats of a terasort being ``` 2023-04-19 16:15:23,305 INFO [JUnit-test_140_teracomplete]: statistics.IOStatisticsLogging (IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: counters=((commit_file_rename=4) (committer_bytes_committed=200021) (committer_commit_job=3) (committer_files_committed=4) (committer_task_directory_depth=7) (committer_task_file_count=8) (committer_task_file_size=200021) (committer_task_manifest_file_size=127578) (job_stage_create_target_dirs=3) (job_stage_load_manifests=3) (job_stage_rename_files=3) (job_stage_setup=3) (op_create_directories=3) (op_delete=3) (op_get_file_status=13) (op_get_file_status.failures=13) (op_list_status=10) (op_load_all_manifests=3) (op_load_manifest=7) (op_mkdirs=13) (op_msync=3) (task_stage_commit=7) (task_stage_scan_directory=7) (task_stage_setup=7)); gauges=((stage.job_stage_create_target_dirs.free.memory=1542076752) (stage.job_stage_create_target_dirs.heap.memory=533055152) (stage.job_stage_create_target_dirs.total.memory=2075131904) (stage.job_stage_load_manifests.free.memory=1544775808) (stage.job_stage_load_manifests.heap.memory=530356096) (stage.job_stage_load_manifests.total.memory=2075131904) (stage.job_stage_rename_files.free.memory=1505757416) (stage.job_stage_rename_files.heap.memory=569374488) (stage.job_stage_rename_files.total.memory=2075131904) (stage.setup.free.memory=1688139168) (stage.setup.heap.memory=386992736) (stage.setup.total.memory=2075131904)); minimums=((commit_file_rename.min=53) (committer_task_directory_count=0) (committer_task_directory_depth=1) (committer_task_file_count=0) (committer_task_file_size=0) (committer_task_manifest_file_size=18018) (job_stage_create_target_dirs.min=3) (job_stage_load_manifests.min=184) (job_stage_rename_files.min=73) (job_stage_setup.min=267) (op_create_directories.min=0) (op_delete.min=30) (op_get_file_status.failures.min=24) (op_list_status.min=170) (op_load_all_manifests.min=73) (op_load_manifest.min=54) (op_mkdirs.min=26) (op_msync.min=0) (task_stage_commit.min=176) (task_stage_scan_directory.min=176) (task_stage_setup.min=52)); maximums=((commit_file_rename.max=62) (committer_task_directory_count=0) (committer_task_directory_depth=1) (committer_task_file_count=1) (committer_task_file_size=100000) (committer_task_manifest_file_size=18389) (job_stage_create_target_dirs.max=4) (job_stage_load_manifests.max=250) (job_stage_rename_files.max=74) (job_stage_setup.max=295) (op_create_directories.max=1) (op_delete.max=42) (op_get_file_status.failures.max=113) (op_list_status.max=189) (op_load_all_manifests.max=139) (op_load_manifest.max=125) (op_mkdirs.max=74) (op_msync.max=0) (task_stage_commit.max=194) (task_stage_scan_directory.max=194) (task_stage_setup.max=93)); means=((commit_file_rename.mean=(samples=4, sum=227, mean=56.7500)) (committer_task_directory_count=(samples=14, sum=0, mean=0.0000)) (committer_task_directory_depth=(samples=7, sum=7, mean=1.0000)) (committer_task_file_count=(samples=14, sum=8, mean=0.5714)) (committer_task_file_size=(samples=7, sum=200021, mean=28574.4286)) (committer_task_manifest_file_size=(samples=7, sum=127578, mean=18225.4286)) (job_stage_create_target_dirs.mean=(samples=3, sum=11, mean=3.6667)) (job_stage_load_manifests.mean=(samples=3, sum=638, mean=212.6667)) (job_stage_rename_files.mean=(samples=3, sum=220, mean=73.3333)) (job_stage_setup.mean=(samples=3, sum=845, mean=281.6667)) (op_create_directories.mean=(samples=3, sum=2, mean=0.6667)) (op_delete.mean=(samples=3, sum=107, mean=35.6667)) (op_get_file_status.failures.mean=(samples=13, sum=638, mean=49.0769)) (op_list_status.mean=(samples=10, sum=1431, mean=143.1000)) (op_load_all_manifests.mean=(samples=3, sum=287, mean=95.6667)) (op_load_manifest.mean=(samples=7, sum=546, mean=78.0000)) (op_mkdirs.mean=(samples=13, sum=580, mean=44.6154)) (op_msync.mean=(samples=3, sum=0, mean=0.0000)) (task_stage_commit.mean=(samples=7, sum=1279, mean=182.7143)) (task_stage_scan_directory.mean=(samples=7, sum=1279, mean=182.7143)) (task_stage_setup.mean=(samples=7, sum=505, mean=72.1429))); ``` once abfs adds iostats context update of input stream reads, we could collect and add that into the stats too; not worrying about it until then. > ManifestCommitter OOM on azure job > ---------------------------------- > > Key: MAPREDUCE-7435 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7435 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: client > Affects Versions: 3.3.5 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > Labels: pull-request-available > > I've got some reports of spark jobs OOM if the manifest committer is used > through abfs. > either the manifests are using too much memory, or something is not working > with azure stream memory use (or both). > before proposing a solution, first step should be to write a test to load > many, many manifests, each with lots of dirs and files to see what breaks. > note: we did have OOM issues with the s3a committer, on teragen but those > structures have to include every etag of every block, so the manifest size is > O(blocks); the new committer is O(files + dirs). > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.readOneBlock(AbfsInputStream.java:314) > at > org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.read(AbfsInputStream.java:267) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:539) > at > com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:133) > at > com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:256) > at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1656) > at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1085) > at > com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3585) > at > org.apache.hadoop.util.JsonSerialization.fromJsonStream(JsonSerialization.java:164) > at org.apache.hadoop.util.JsonSerialization.load(JsonSerialization.java:279) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.TaskManifest.load(TaskManifest.java:361) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.ManifestStoreOperationsThroughFileSystem.loadTaskManifest(ManifestStoreOperationsThroughFileSystem.java:133) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.AbstractJobOrTaskStage.lambda$loadManifest$6(AbstractJobOrTaskStage.java:493) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.AbstractJobOrTaskStage$$Lambda$231/1813048085.apply(Unknown > Source) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding$$Lambda$217/489150849.apply(Unknown > Source) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:445) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.AbstractJobOrTaskStage.loadManifest(AbstractJobOrTaskStage.java:492) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.LoadManifestsStage.fetchTaskManifest(LoadManifestsStage.java:170) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.LoadManifestsStage.processOneManifest(LoadManifestsStage.java:138) > at > org.apache.hadoop.mapreduce.lib.output.committer.manifest.stages.LoadManifestsStage$$Lambda$229/137752948.run(Unknown > Source) > at > org.apache.hadoop.util.functional.TaskPool$Builder.lambda$runParallel$0(TaskPool.java:410) > at > org.apache.hadoop.util.functional.TaskPool$Builder$$Lambda$230/467893357.run(Unknown > Source) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org