[ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=281136&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-281136
 ]

ASF GitHub Bot logged work on BEAM-7495:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Jul/19 16:43
            Start Date: 23/Jul/19 16:43
    Worklog Time Spent: 10m 
      Work Description: aryann commented on pull request #9079: [BEAM-7495] Add 
progress reporting to the BigQuery source
URL: https://github.com/apache/beam/pull/9079#discussion_r306422346
 
 

 ##########
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageStreamSource.java
 ##########
 @@ -201,15 +206,20 @@ public synchronized boolean advance() throws IOException 
{
     private synchronized boolean readNextRecord() throws IOException {
       while (decoder == null || decoder.isEnd()) {
         if (!responseIterator.hasNext()) {
+          fractionConsumed = 1d;
           return false;
         }
 
+        // N.B.: For simplicity, we update fractionConsumed once a new 
response is fetched, not
+        // when we reach the end of the current response. In practice, this 
choice is not
+        // consequential.
+        fractionConsumed = fractionConsumedFromLastResponse;
 
 Review comment:
   I was under the impression that getCurrent() should not modify any state.
   
   The JavaDoc doesn't explicitly call out where the progress should be 
updated, but it mentions, "Should return 1 after [...] Source.Reader.advance() 
call that returns false."[1] Reading this and the JavaDoc for advance() and 
getCurrent() leads me to believe that advance() is the appropriate place to 
update progress.
   
   Please note that in the code I have written, progress is only updated after 
you have called advance() enough times to go through all rows in the response 
that returned the new progress. You can see an example of this in the unit 
tests.
   
   I also found two examples where the progress is updated in advance():
    
   - 
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java#L838
   
   - 
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseIO.java#L449
 (uses range tracker that updates the progress; note that getCurrent() does not 
modify any state)
   
   [1] 
https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/BoundedSource.BoundedReader.html#getFractionConsumed--
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 281136)
            Time Spent: 4h 50m  (was: 4h 40m)
    Remaining Estimate: 499h 10m  (was: 499h 20m)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> -------------------------------------------------------------------------------------------
>
>                 Key: BEAM-7495
>                 URL: https://issues.apache.org/jira/browse/BEAM-7495
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-gcp
>            Reporter: Aryan Naraghi
>            Assignee: Aryan Naraghi
>            Priority: Major
>   Original Estimate: 504h
>          Time Spent: 4h 50m
>  Remaining Estimate: 499h 10m
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to