[jira] [Commented] (BEAM-6206) Dataflow template which reads from BigQuery fails if used more than once

Neil McCrossin (JIRA) Tue, 11 Dec 2018 11:30:56 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717845#comment-16717845
 ]


Neil McCrossin commented on BEAM-6206:
--------------------------------------

Probably the same issue as BEAM-5434.

> Dataflow template which reads from BigQuery fails if used more than once
> ------------------------------------------------------------------------
>
>                 Key: BEAM-6206
>                 URL: https://issues.apache.org/jira/browse/BEAM-6206
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp, runner-dataflow
>    Affects Versions: 2.8.0
>            Reporter: Neil McCrossin
>            Assignee: Tyler Akidau
>            Priority: Major
>
> When a pipeline contains a BigQuery read, and when that pipeline is uploaded 
> as a template and the template is run in Cloud Dataflow, it will run 
> successfully the first time, but after that it will fail because it can't 
> find a file in the folder BigQueryExtractTemp (see error message below). If 
> the template is uploaded again it will work again +once only+ and then fail 
> again every time after the first time.
> *Error message:*
>  java.io.FileNotFoundException: No files matched spec: 
> gs://bigquery-bug-report-4539/temp/BigQueryExtractTemp/847a342637a64e73b126ad33f764dcc9/000000000000.avro
> *Steps to reproduce:*
>  1. Create the Beam Word Count sample as described 
> [here|https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven].
> 2. Copy the command line from the section "Run WordCount on the Cloud 
> Dataflow service" and substitute in your own project id and bucket name. Make 
> sure you can run it successfully.
> 3. In the file WordCount.java, add the following lines below the existing 
> import statements:
> {code:java}
> import org.apache.beam.sdk.coders.AvroCoder;
> import org.apache.beam.sdk.coders.DefaultCoder;
> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
> import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
> import org.apache.beam.sdk.transforms.SerializableFunction;
> @DefaultCoder(AvroCoder.class)
> class TestOutput
> { 
> }
> {code}
>  
>  4. In this same file, replace the entire method runWordCount with the 
> following code:
> {code:java}
> static void runWordCount(WordCountOptions options) {
>   Pipeline p = Pipeline.create(options);
>   p.apply("ReadBigQuery", BigQueryIO
>     .read(new SerializableFunction<SchemaAndRecord, TestOutput>() {
>       public TestOutput apply(SchemaAndRecord record) {
>         return new TestOutput();
>       }
>     })
>     .from("bigquery-public-data:stackoverflow.tags")
>   );
>   p.run();
> }
> {code}
> (Note I am using the stackoverflow.tags table for purposes of demonstration 
> because it is public and not too large, but the problem seems to occur for 
> any table).
> 5. Add the following pipeline parameters to the command line that you have 
> been using:
> {code:java}
> --tempLocation=gs://<STORAGE_BUCKET>/temp/
> --templateLocation=gs://<STORAGE_BUCKET>/my-bigquery-dataflow-template
> {code}
> 6. Run the command line so that the template is created.
> 7. Launch the template through the Cloud Console by clicking on "CREATE JOB 
> FROM TEMPLATE". Give it the job name "test-1", choose "Custom Template" at 
> the bottom of the list and browse to the template 
> "my-bigquery-dataflow-template", then press "Run job".
> 8. The job should succeed. But then repeat step 7 and it will fail.
> 9. Repeat steps 6 and 7 and it will work again. Repeat step 7 and it will 
> fail again.
>  
> This bug may be related to BEAM-2058 (just a hunch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6206) Dataflow template which reads from BigQuery fails if used more than once

Reply via email to