steveloughran commented on PR #5519:
URL: https://github.com/apache/hadoop/pull/5519#issuecomment-1515223731
parallel test running failed everywhere, but I have improved
ITestAbfsLoadManifestsStage performance
* back to the original 200 manifest files
* increase worker pool and buffer queue size (more significant before
reducing the manifest count)
brings test time down to 10s locally. IOStats does imply many MB of data is
being PUT/GET so it is good to keep small so people running with less
bandwidth don't suffer. Maybe, maybe, the size could switch
with a -Dscale?
iostat.
there seems a lot of delete requests, but its because when we write the
manifest it is done as a write to temp then rename, and the dest is deleted
first, without any check.
in production that cost is absorbed in task commit, and @60ms vs 40 for a
head, we should decide what to do here. I think for renames in job commit, we
could do the HEAD before the DELETE simply because that is bottleneck, so maybe
do it here too...
```
2023-04-19 19:43:05,489 INFO [JUnit]:
manifest.AbstractManifestCommitterTest
(AbstractManifestCommitterTest.java:dumpFileSystemIOStatistics(450)) -
Aggregate FileSystem Statistics counters=((action_http_delete_request=402)
(action_http_delete_request.failures=200)
(action_http_get_request=202)
(action_http_head_request=404)
(action_http_head_request.failures=202)
(action_http_put_request=1103)
(bytes_received=10160814)
(bytes_sent=10160814)
(committer_task_directory_count=20000)
(committer_task_file_count=20000)
(committer_task_manifest_file_size=10160814)
(connections_made=2111)
(directories_created=303)
(files_created=200)
(get_responses=2111)
(job_stage_create_target_dirs=1)
(job_stage_load_manifests=1)
(job_stage_setup=1)
(op_create=200)
(op_create_directories=1)
(op_delete=803)
(op_get_file_status=407)
(op_get_file_status.failures=202)
(op_list_status=2)
(op_load_all_manifests=1)
(op_load_manifest=200)
(op_mkdirs=605)
(op_msync=1)
(op_open=200)
(op_rename=400)
(rename_path_attempts=200)
(send_requests=1103)
(task_stage_save_manifest=200)
(task_stage_save_task_manifest=200)
(task_stage_setup=200));
gauges=();
minimums=((action_http_delete_request.failures.min=25)
(action_http_delete_request.min=36)
(action_http_get_request.min=40)
(action_http_head_request.failures.min=22)
(action_http_head_request.min=20)
(action_http_put_request.min=24)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=49990)
(job_stage_create_target_dirs.min=259)
(job_stage_load_manifests.min=2804)
(job_stage_setup.min=183)
(op_create_directories.min=256)
(op_delete.min=25)
(op_get_file_status.failures.min=22)
(op_get_file_status.min=22)
(op_list_status.min=87)
(op_load_all_manifests.min=2627)
(op_load_manifest.min=49)
(op_mkdirs.min=24)
(op_msync.min=0)
(op_rename.min=70)
(task_stage_save_manifest.min=273)
(task_stage_save_task_manifest.min=144)
(task_stage_setup.min=51));
maximums=((action_http_delete_request.failures.max=413)
(action_http_delete_request.max=291)
(action_http_get_request.max=2031)
(action_http_head_request.failures.max=439)
(action_http_head_request.max=430)
(action_http_put_request.max=2662)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=50876)
(job_stage_create_target_dirs.max=259)
(job_stage_load_manifests.max=2804)
(job_stage_setup.max=183)
(op_create_directories.max=256)
(op_delete.max=413)
(op_get_file_status.failures.max=439)
(op_get_file_status.max=22)
(op_list_status.max=127)
(op_load_all_manifests.max=2627)
(op_load_manifest.max=2031)
(op_mkdirs.max=245)
(op_msync.max=0)
(op_rename.max=932)
(task_stage_save_manifest.max=2863)
(task_stage_save_task_manifest.max=2757)
(task_stage_setup.max=471));
means=((action_http_delete_request.failures.mean=(samples=200, sum=9850,
mean=49.2500))
(action_http_delete_request.mean=(samples=202, sum=12448, mean=61.6238))
(action_http_get_request.mean=(samples=202, sum=78955, mean=390.8663))
(action_http_head_request.failures.mean=(samples=202, sum=12782,
mean=63.2772))
(action_http_head_request.mean=(samples=202, sum=8096, mean=40.0792))
(action_http_put_request.mean=(samples=1103, sum=108966, mean=98.7906))
(committer_task_directory_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_file_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_manifest_file_size=(samples=200, sum=10160814,
mean=50804.0700))
(job_stage_create_target_dirs.mean=(samples=1, sum=259, mean=259.0000))
(job_stage_load_manifests.mean=(samples=1, sum=2804, mean=2804.0000))
(job_stage_setup.mean=(samples=1, sum=183, mean=183.0000))
(op_create_directories.mean=(samples=1, sum=256, mean=256.0000))
(op_delete.mean=(samples=401, sum=22278, mean=55.5561))
(op_get_file_status.failures.mean=(samples=202, sum=12806, mean=63.3960))
(op_get_file_status.mean=(samples=1, sum=22, mean=22.0000))
(op_list_status.mean=(samples=2, sum=214, mean=107.0000))
(op_load_all_manifests.mean=(samples=1, sum=2627, mean=2627.0000))
(op_load_manifest.mean=(samples=200, sum=79570, mean=397.8500))
(op_mkdirs.mean=(samples=302, sum=15536, mean=51.4437))
(op_msync.mean=(samples=1, sum=0, mean=0.0000))
(op_rename.mean=(samples=200, sum=22999, mean=114.9950))
(task_stage_save_manifest.mean=(samples=200, sum=115797, mean=578.9850))
(task_stage_save_task_manifest.mean=(samples=200, sum=82893, mean=414.4650))
(task_stage_setup.mean=(samples=200, sum=21878, mean=109.3900)));
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]