[jira] [Commented] (BEAM-11029) A dataflow job created the second time from the same template is unable to read BigQuery.

Christian (Jira) Wed, 25 May 2022 10:48:04 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-11029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542167#comment-17542167
 ]


Christian commented on BEAM-11029:
----------------------------------

Hi, 

i face the same issue. Where you able to fix this?

> A dataflow job created the second time from the same template is unable to 
> read BigQuery.
> -----------------------------------------------------------------------------------------
>
>                 Key: BEAM-11029
>                 URL: https://issues.apache.org/jira/browse/BEAM-11029
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.24.0
>         Environment: GCP Dataflow
>            Reporter: Andrei
>            Priority: P3
>              Labels: bigquery, dataflow
>
> h2. Error
> The first job created from a custom template reads from BigQuery 
> successfully. When the first job is done, the second job created from the 
> same template fails with an error:
>  
> {noformat}
> 2020-10-06 14:48:02.002 PDTError message from worker: java.io.IOException: 
> Cannot start an export job since table 
> retailsearchproject-lowes:temp_dataset_beam_bq_job_QUERY_bigquerypipelinealitvinov100621394839e5f68a_eba5287932a74fb88787362144bd56f2.temp_table_beam_bq_job_QUERY_bigquerypipelinealitvinov100621394839e5f68a_eba5287932a74fb88787362144bd56f2
>  does not exist 
> org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:115)
>  
> org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
>  
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:290)
>  
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:212)
>  
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:196)
>  
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:175)
>  
> org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
>  
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>  
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>  
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>  
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>  
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>  
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>  java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748) {noformat}
> h2. Library version
> {{org.apache.beam:beam-runners-google-cloud-dataflow-java:2.24.0}}
> h2. Reproducing
>  * Clone the  repository with an example java project.
> {code:java}
> https://github.com/alitvinov-gd/dataflow-bigquery-run-job-twice.git{code}
>  * Create a BigQuery table {{repeating_job_input}} and populate a row:
> {code:java}
> insert into andrei_L.repeating_job_input (id, group_id, dummy) values('100', 
> '10000', 'DUMMY1');
> {code}
>  * Create a template
> {code:java}
> mvn clean compile exec:java 
> -Dexec.mainClass=org.litvinov.andrei.beam.bquery.example.BigqueryPipeline 
> -Dexec.args="--runner=DataflowRunner --region=us-central1 --project=<your 
> project> --stagingLocation=gs://<your bucket>/staging/ 
> --inputTable=andrei_L.repeating_job_input --outputFile=gs://<your 
> bucket>/temp/repeating_job_output --tempLocation=gs://<your bucket>/temp/ 
> --templateLocation=gs://<your bucket>/templates/example_template 
> --workerMachineType=n1-standard-8" {code}
>  * Create a job the first time
> {code:java}
> gcloud dataflow jobs run first_job --project=<your project> 
> --region=us-central1 --gcs-location=gs://<your 
> bucket>/templates/example_template
> {code}
>  * After the job finishes successfully, check the output file
> {code:java}
> gsutil cat gs://<your bucket/temp/repeating_job_output-00000-of-00001
> {code}
> expecting to see:
> {code:java}
> 1000
> {code}
>  
>  * Clean the output file
> {code:java}
> gsutil rm gs://<your bucket/temp/repeating_job_output-00000-of-00001{code}
>  
>  * Create the second job
> {code:java}
> gcloud dataflow jobs run second_job --project=<your project> 
> --region=us-central1 --gcs-location=gs://<your 
> bucket>/templates/example_template
> {code}
>  * Expect the second job to fail
>  * Expect no output file created
> {code:java}
> gsutil ls gs://<your bucket/temp/repeating_job_output-00000-of-00001 
> CommandException: One or more URLs matched no objects{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (BEAM-11029) A dataflow job created the second time from the same template is unable to read BigQuery.

Reply via email to