[jira] [Work logged] (BEAM-7484) Throughput collection in BigQuery performance tests

ASF GitHub Bot (JIRA) Fri, 12 Jul 2019 05:37:09 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-7484?focusedWorklogId=275869&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-275869
 ]


ASF GitHub Bot logged work on BEAM-7484:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Jul/19 12:36
            Start Date: 12/Jul/19 12:36
    Worklog Time Spent: 10m 
      Work Description: kamilwu commented on pull request #8766: [BEAM-7484] 
Metrics collection in BigQuery perf tests
URL: https://github.com/apache/beam/pull/8766#discussion_r302961841
 
 

 ##########
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_perf_test.py
 ##########
 @@ -126,9 +133,21 @@ def format_record(record):
     p.run().wait_until_finish()
 
   def test(self):
+    def extract_values(row):
+      """Extracts value from a row."""
+      yield base64.b64decode(row.values()[0])
+
     self.result = (self.pipeline
                    | 'Read from BigQuery' >> Read(BigQuerySource(
                        dataset=self.input_dataset, table=self.input_table))
+                   | 'Measure bytes' >> ParDo(MeasureBytes(
+                       self.metrics_namespace, extract_values))
+                   | 'Count messages' >> ParDo(CountMessages(
+                       self.metrics_namespace))
+                   | 'Measure time: Start' >> ParDo(MeasureTime(
+                       self.metrics_namespace))
+                   | 'Measure time: End' >> ParDo(MeasureTime(
 
 Review comment:
   > I don't understand what the difference is between Start and End versions 
of MeasureTime. 
   
   Indeed, there is no difference between Start and End versions in this case. 
I'll leave only one monitor then.
   
   > What time values do you extract from these metrics?
   
   In general, MeasureTime steps checks the processing time of the first 
element (let's say: `x`) and the last element (let's say: `y`) in PCollection. 
The difference between the minimum of `x` and the maximum of `y` is in 
approximation total running time. I believe that's the best way to measure IOs 
performance for now, since we don't have any metrics gathered by the IO itself. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 275869)
    Time Spent: 3h 50m  (was: 3h 40m)

> Throughput collection in BigQuery performance tests
> ---------------------------------------------------
>
>                 Key: BEAM-7484
>                 URL: https://issues.apache.org/jira/browse/BEAM-7484
>             Project: Beam
>          Issue Type: New Feature
>          Components: testing
>            Reporter: Kamil Wasilewski
>            Assignee: Kamil Wasilewski
>            Priority: Major
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The goal is to collect bytes/time and messages/time metrics in BQ read and 
> write tests in Python SDK.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7484) Throughput collection in BigQuery performance tests

Reply via email to