[GitHub] [beam] damccorm opened a new issue, #20688: A dataflow job created the second time from the same template is unable to read BigQuery.

GitBox Sat, 04 Jun 2022 12:02:36 -0700


damccorm opened a new issue, #20688:
URL: https://github.com/apache/beam/issues/20688


   ## Error
   
   The first job created from a custom template reads from BigQuery 
successfully. When the first job is done, the second job created from the same 
template fails with an error:
   
    
   ```
   
   2020-10-06 14:48:02.002 PDTError message from worker: java.io.IOException: 
Cannot start an export job
   since table 
retailsearchproject-lowes:temp_dataset_beam_bq_job_QUERY_bigquerypipelinealitvinov100621394839e5f68a_eba5287932a74fb88787362144bd56f2.temp_table_beam_bq_job_QUERY_bigquerypipelinealitvinov100621394839e5f68a_eba5287932a74fb88787362144bd56f2
   does not exist 
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:115)
   
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
 
org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:290)
   
org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:212)
   
org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:196)
   
org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:175)
   
org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
   
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
   
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
 
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
   
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
   
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
   
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
   java.util.concurrent.FutureTask.run(FutureTask.java:266) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
java.lang.Thread.run(Thread.java:748)
   
   ```
   
   ## Library version
   
   `org.apache.beam:beam-runners-google-cloud-dataflow-java:2.24.0`
   ## Reproducing
    * Clone the  repository with an example java project.
   
   ```
   
   https://github.com/alitvinov-gd/dataflow-bigquery-run-job-twice.git
   ```
   
    * Create a BigQuery table `repeating_job_input` and populate a row:
   
   ```
   
   insert into andrei_L.repeating_job_input (id, group_id, dummy) values('100', 
'10000', 'DUMMY1');
   
   ```
   
    * Create a template
   
   ```
   
   mvn clean compile exec:java 
-Dexec.mainClass=org.litvinov.andrei.beam.bquery.example.BigqueryPipeline
   -Dexec.args="--runner=DataflowRunner --region=us-central1 --project=<your 
project> --stagingLocation=gs://<your
   bucket>/staging/ --inputTable=andrei_L.repeating_job_input 
--outputFile=gs://<your bucket>/temp/repeating_job_output
   --tempLocation=gs://<your bucket>/temp/ --templateLocation=gs://<your 
bucket>/templates/example_template
   --workerMachineType=n1-standard-8" 
   ```
   
    * Create a job the first time
   
   ```
   
   gcloud dataflow jobs run first_job --project=<your project> 
--region=us-central1 --gcs-location=gs://<your
   bucket>/templates/example_template
   
   ```
   
    * After the job finishes successfully, check the output file
   
   ```
   
   gsutil cat gs://<your bucket/temp/repeating_job_output-00000-of-00001
   
   
   ```
   
   expecting to see:
   ```
   
   1000
   
   ```
   
    
    * Clean the output file
   
   ```
   
   gsutil rm gs://<your bucket/temp/repeating_job_output-00000-of-00001
   ```
   
    
    * Create the second job
   
   ```
   
   gcloud dataflow jobs run second_job --project=<your project> 
--region=us-central1 --gcs-location=gs://<your
   bucket>/templates/example_template
   
   ```
   
    * Expect the second job to fail
    * Expect no output file created
   
   ```
   
   gsutil ls gs://<your bucket/temp/repeating_job_output-00000-of-00001 
CommandException: One or more
   URLs matched no objects
   ```
   
   
   Imported from Jira 
[BEAM-11029](https://issues.apache.org/jira/browse/BEAM-11029). Original Jira 
may contain additional context.
   Reported by: alitvinov-gd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20688: A dataflow job created the second time from the same template is unable to read BigQuery.

Reply via email to