[GitHub] [beam] AnandInguva commented on a diff in pull request #24347: Add Pytorch RunInference GPU benchmark

GitBox Tue, 29 Nov 2022 12:27:24 -0800


AnandInguva commented on code in PR #24347:
URL: https://github.com/apache/beam/pull/24347#discussion_r1035245877



##########
.test-infra/jenkins/job_InferenceBenchmarkTests_Python.groovy:
##########
@@ -134,27 +137,60 @@ def loadTestConfigurations = {
         influx_measurement    : 'torch_language_modeling_bert_large_uncased',
         influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
         influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        device                : 'CPU',
         input_file            : 
'gs://apache-beam-ml/testing/inputs/sentences_50k.txt',
         bert_tokenizer        : 'bert-large-uncased',
         model_state_dict_path : 
'gs://apache-beam-ml/models/huggingface.BertForMaskedLM.bert-large-uncased.pth',
         output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_bert_large_uncased' + now 
+ '.txt'
       ]
     ],
+    [
+      title             : 'Pytorch Imagenet Classification with Resnet 152 
with Tesla T4 GPU',
+      test              : 
'apache_beam.testing.benchmarks.inference.pytorch_image_classification_benchmarks',
+      runner            : CommonTestProperties.Runner.DATAFLOW,
+      pipelineOptions: [
+        job_name              : 'benchmark-tests-pytorch-imagenet-python-gpu' 
+ now,
+        project               : 'apache-beam-testing',
+        region                : 'us-central1',
+        machine_type          : 'n1-standard-2',
+        num_workers           : 75, // this could be lower as the quota for 
the apache-beam-testing project is 32 T4 GPUs as of November 28th, 2022.
+        disk_size_gb          : 50,
+        autoscaling_algorithm : 'NONE',
+        staging_location      : 'gs://temp-storage-for-perf-tests/loadtests',
+        temp_location         : 'gs://temp-storage-for-perf-tests/loadtests',
+        requirements_file     : 
'apache_beam/ml/inference/torch_tests_requirements.txt',
+        publish_to_big_query  : true,
+        metrics_dataset       : 'beam_run_inference',
+        metrics_table         : 
'torch_inference_imagenet_results_resnet152_tesla_t4',
+        input_options         : '{}', // this option is not required for 
RunInference tests.
+        influx_measurement    : 'torch_inference_imagenet_resnet152_tesla_t4',
+        influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
+        influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        pretrained_model_name : 'resnet152',
+        device                : 'GPU',
+        experiments           : 
'worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver',
+        sdk_container_image   : 
'us.gcr.io/apache-beam-testing/python-postcommit-it/tensor_rt:latest',
+        input_file            : 
'gs://apache-beam-ml/testing/inputs/openimage_50k_benchmark.txt',
+        model_state_dict_path : 
'gs://apache-beam-ml/models/torchvision.models.resnet152.pth',
+        output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_resnet152_gpu' + now + 
'.txt'
+      ]
+    ],
   ]
 }
 
 def loadTestJob = { scope ->
   List<Map> testScenarios = loadTestConfigurations()
   for (Map testConfig: testScenarios){
     commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 180)
-    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null, testConfig.pipelineOptions.requirements_file)
+    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null,
+        testConfig.pipelineOptions.requirements_file, '3.8')

Review Comment:
   I added a param to the PythonTestProperties. PTAL



##########
.test-infra/jenkins/job_InferenceBenchmarkTests_Python.groovy:
##########
@@ -134,27 +137,60 @@ def loadTestConfigurations = {
         influx_measurement    : 'torch_language_modeling_bert_large_uncased',
         influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
         influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        device                : 'CPU',
         input_file            : 
'gs://apache-beam-ml/testing/inputs/sentences_50k.txt',
         bert_tokenizer        : 'bert-large-uncased',
         model_state_dict_path : 
'gs://apache-beam-ml/models/huggingface.BertForMaskedLM.bert-large-uncased.pth',
         output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_bert_large_uncased' + now 
+ '.txt'
       ]
     ],
+    [
+      title             : 'Pytorch Imagenet Classification with Resnet 152 
with Tesla T4 GPU',
+      test              : 
'apache_beam.testing.benchmarks.inference.pytorch_image_classification_benchmarks',
+      runner            : CommonTestProperties.Runner.DATAFLOW,
+      pipelineOptions: [
+        job_name              : 'benchmark-tests-pytorch-imagenet-python-gpu' 
+ now,
+        project               : 'apache-beam-testing',
+        region                : 'us-central1',
+        machine_type          : 'n1-standard-2',
+        num_workers           : 75, // this could be lower as the quota for 
the apache-beam-testing project is 32 T4 GPUs as of November 28th, 2022.
+        disk_size_gb          : 50,
+        autoscaling_algorithm : 'NONE',
+        staging_location      : 'gs://temp-storage-for-perf-tests/loadtests',
+        temp_location         : 'gs://temp-storage-for-perf-tests/loadtests',
+        requirements_file     : 
'apache_beam/ml/inference/torch_tests_requirements.txt',
+        publish_to_big_query  : true,
+        metrics_dataset       : 'beam_run_inference',
+        metrics_table         : 
'torch_inference_imagenet_results_resnet152_tesla_t4',
+        input_options         : '{}', // this option is not required for 
RunInference tests.
+        influx_measurement    : 'torch_inference_imagenet_resnet152_tesla_t4',
+        influx_db_name        : InfluxDBCredentialsHelper.InfluxDBDatabaseName,
+        influx_hostname       : InfluxDBCredentialsHelper.InfluxDBHostUrl,
+        pretrained_model_name : 'resnet152',
+        device                : 'GPU',
+        experiments           : 
'worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver',
+        sdk_container_image   : 
'us.gcr.io/apache-beam-testing/python-postcommit-it/tensor_rt:latest',
+        input_file            : 
'gs://apache-beam-ml/testing/inputs/openimage_50k_benchmark.txt',
+        model_state_dict_path : 
'gs://apache-beam-ml/models/torchvision.models.resnet152.pth',
+        output                : 
'gs://temp-storage-for-end-to-end-tests/torch/result_resnet152_gpu' + now + 
'.txt'
+      ]
+    ],
   ]
 }
 
 def loadTestJob = { scope ->
   List<Map> testScenarios = loadTestConfigurations()
   for (Map testConfig: testScenarios){
     commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 180)
-    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null, testConfig.pipelineOptions.requirements_file)
+    loadTestsBuilder.loadTest(scope, testConfig.title, testConfig.runner, 
CommonTestProperties.SDK.PYTHON, testConfig.pipelineOptions, testConfig.test, 
null,
+        testConfig.pipelineOptions.requirements_file, '3.8')
   }
 }
 
 PhraseTriggeringPostCommitBuilder.postCommitJob(
     'beam_Inference_Python_Benchmarks_Dataflow',
     'Run Inference Benchmarks',
-    'Inference benchmarks on Dataflow(\"Run Inference Benchmarks"\"")',
+    'Beam Inference benchmarks on Dataflow(\"Run Inference Benchmarks"\"")',

Review Comment:
   Done.



##########
sdks/python/apache_beam/testing/benchmarks/inference/README.md:
##########
@@ -62,12 +80,24 @@ the following metrics:
 - Mean Load Model Latency - the average amount of time it takes to load a 
model. This is done once per DoFn instance on worker
 startup, so the cost is amortized across the pipeline.
 
+These metrics are published to InfluxDB and BigQuery.
+
+<h3>Pytorch Language Modeling Tests</h3>

Review Comment:
   Done



##########
sdks/python/apache_beam/testing/benchmarks/inference/README.md:
##########
@@ -38,16 +38,34 @@ the following metrics:
 - Mean Load Model Latency - the average amount of time it takes to load a 
model. This is done once per DoFn instance on worker
 startup, so the cost is amortized across the pipeline.
 
+These metrics are published to InfluxDB and BigQuery.
+
+<h3>Pytorch Image Classification Tests</h3>
+
+* Pytorch Image Classification with Resnet 101.

Review Comment:
   We already explain these in the examples that we use to run these 
benchmarks. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] AnandInguva commented on a diff in pull request #24347: Add Pytorch RunInference GPU benchmark

Reply via email to