[GitHub] [beam] ryanthompson591 commented on a diff in pull request #24347: Add Pytorch RunInference GPU benchmark

GitBox Tue, 29 Nov 2022 12:10:41 -0800


ryanthompson591 commented on code in PR #24347:
URL: https://github.com/apache/beam/pull/24347#discussion_r1035225775



##########
sdks/python/apache_beam/testing/benchmarks/inference/README.md:
##########
@@ -62,12 +80,24 @@ the following metrics:
 - Mean Load Model Latency - the average amount of time it takes to load a 
model. This is done once per DoFn instance on worker
 startup, so the cost is amortized across the pipeline.
 
+These metrics are published to InfluxDB and BigQuery.
+
+<h3>Pytorch Language Modeling Tests</h3>

Review Comment:
   nit: use ### instead of h3 for consistency.



##########
sdks/python/apache_beam/testing/benchmarks/inference/README.md:
##########
@@ -38,16 +38,34 @@ the following metrics:
 - Mean Load Model Latency - the average amount of time it takes to load a 
model. This is done once per DoFn instance on worker
 startup, so the cost is amortized across the pipeline.
 
+These metrics are published to InfluxDB and BigQuery.
+
+<h3>Pytorch Image Classification Tests</h3>
+
+* Pytorch Image Classification with Resnet 101.

Review Comment:
   (optional) I would just describe what the tests do and why here. The users 
can look up these details in the test itself (these will become out of date if 
someone changes the test parameters but not these).



##########
.test-infra/jenkins/job_InferenceBenchmarkTests_Python.groovy:
##########
@@ -134,27 +137,60 @@ def loadTestConfigurations = {
         influx_measurement    : 'torch_language_modeling_bert_large_uncased',
         influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
         influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        device                : 'CPU',
         input_file            : 
'gs://apache-beam-ml/testing/inputs/sentences_50k.txt',
         bert_tokenizer        : 'bert-large-uncased',
         model_state_dict_path : 
'gs://apache-beam-ml/models/huggingface.BertForMaskedLM.bert-large-uncased.pth',
         output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_bert_large_uncased' + now 
+ '.txt'
       ]
     ],
+    [
+      title             : 'Pytorch Imagenet Classification with Resnet 152 
with Tesla T4 GPU',
+      test              : 
'apache_beam.testing.benchmarks.inference.pytorch_image_classification_benchmarks',
+      runner            : CommonTestProperties.Runner.DATAFLOW,
+      pipelineOptions: [
+        job_name              : 'benchmark-tests-pytorch-imagenet-python-gpu' 
+ now,
+        project               : 'apache-beam-testing',
+        region                : 'us-central1',
+        machine_type          : 'n1-standard-2',
+        num_workers           : 75, // this could be lower as the quota for 
the apache-beam-testing project is 32 T4 GPUs as of November 28th, 2022.
+        disk_size_gb          : 50,
+        autoscaling_algorithm : 'NONE',
+        staging_location      : 'gs://temp-storage-for-perf-tests/loadtests',
+        temp_location         : 'gs://temp-storage-for-perf-tests/loadtests',
+        requirements_file     : 
'apache_beam/ml/inference/torch_tests_requirements.txt',
+        publish_to_big_query  : true,
+        metrics_dataset       : 'beam_run_inference',
+        metrics_table         : 
'torch_inference_imagenet_results_resnet152_tesla_t4',
+        input_options         : '{}', // this option is not required for 
RunInference tests.
+        influx_measurement    : 'torch_inference_imagenet_resnet152_tesla_t4',
+        influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
+        influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        pretrained_model_name : 'resnet152',
+        device                : 'GPU',
+        experiments           : 
'worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver',
+        sdk_container_image   : 
'us.gcr.io/apache-beam-testing/python-postcommit-it/tensor_rt:latest',
+        input_file            : 
'gs://apache-beam-ml/testing/inputs/openimage_50k_benchmark.txt',
+        model_state_dict_path : 
'gs://apache-beam-ml/models/torchvision.models.resnet152.pth',
+        output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_resnet152_gpu' + now + 
'.txt'
+      ]
+    ],
   ]
 }
 
 def loadTestJob = { scope ->
   List<Map> testScenarios = loadTestConfigurations()
   for (Map testConfig: testScenarios){
     commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 180)
-    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null, testConfig.pipelineOptions.requirements_file)
+    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null,
+        testConfig.pipelineOptions.requirements_file, '3.8')

Review Comment:
   I'm worried that hard coding 3.8 into this might get outdated.



##########
.test-infra/jenkins/job_InferenceBenchmarkTests_Python.groovy:
##########
@@ -134,27 +137,60 @@ def loadTestConfigurations = {
         influx_measurement    : 'torch_language_modeling_bert_large_uncased',
         influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
         influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        device                : 'CPU',
         input_file            : 
'gs://apache-beam-ml/testing/inputs/sentences_50k.txt',
         bert_tokenizer        : 'bert-large-uncased',
         model_state_dict_path : 
'gs://apache-beam-ml/models/huggingface.BertForMaskedLM.bert-large-uncased.pth',
         output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_bert_large_uncased' + now 
+ '.txt'
       ]
     ],
+    [
+      title             : 'Pytorch Imagenet Classification with Resnet 152 
with Tesla T4 GPU',
+      test              : 
'apache_beam.testing.benchmarks.inference.pytorch_image_classification_benchmarks',
+      runner            : CommonTestProperties.Runner.DATAFLOW,
+      pipelineOptions: [
+        job_name              : 'benchmark-tests-pytorch-imagenet-python-gpu' 
+ now,
+        project               : 'apache-beam-testing',
+        region                : 'us-central1',
+        machine_type          : 'n1-standard-2',
+        num_workers           : 75, // this could be lower as the quota for 
the apache-beam-testing project is 32 T4 GPUs as of November 28th, 2022.
+        disk_size_gb          : 50,
+        autoscaling_algorithm : 'NONE',
+        staging_location      : 'gs://temp-storage-for-perf-tests/loadtests',
+        temp_location         : 'gs://temp-storage-for-perf-tests/loadtests',
+        requirements_file     : 
'apache_beam/ml/inference/torch_tests_requirements.txt',
+        publish_to_big_query  : true,
+        metrics_dataset       : 'beam_run_inference',
+        metrics_table         : 
'torch_inference_imagenet_results_resnet152_tesla_t4',
+        input_options         : '{}', // this option is not required for 
RunInference tests.
+        influx_measurement    : 'torch_inference_imagenet_resnet152_tesla_t4',
+        influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
+        influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        pretrained_model_name : 'resnet152',
+        device                : 'GPU',
+        experiments           : 
'worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver',
+        sdk_container_image   : 
'us.gcr.io/apache-beam-testing/python-postcommit-it/tensor_rt:latest',
+        input_file            : 
'gs://apache-beam-ml/testing/inputs/openimage_50k_benchmark.txt',
+        model_state_dict_path : 
'gs://apache-beam-ml/models/torchvision.models.resnet152.pth',
+        output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_resnet152_gpu' + now + 
'.txt'
+      ]
+    ],
   ]
 }
 
 def loadTestJob = { scope ->
   List<Map> testScenarios = loadTestConfigurations()
   for (Map testConfig: testScenarios){
     commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 180)
-    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null, testConfig.pipelineOptions.requirements_file)
+    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null,
+        testConfig.pipelineOptions.requirements_file, '3.8')
   }
 }
 
 PhraseTriggeringPostCommitBuilder.postCommitJob(
     'beam_Inference_Python_Benchmarks_Dataflow',
     'Run Inference Benchmarks',
-    'Inference benchmarks on Dataflow(\"Run Inference Benchmarks"\"")',
+    'Beam Inference benchmarks on Dataflow(\"Run Inference Benchmarks"\"")',

Review Comment:
   seems redundant to say this is a beam inference, maybe RunInference?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #24347: Add Pytorch RunInference GPU benchmark

Reply via email to