[jira] [Updated] (BEAM-6206) Dataflow template which reads from BigQuery fails if used more than once

Neil McCrossin (JIRA) Mon, 10 Dec 2018 21:42:29 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neil McCrossin updated BEAM-6206:
---------------------------------
    Description: 
When a pipeline contains a BigQuery read, and when that pipeline is uploaded as 
a template and the template is run in Cloud Dataflow, it will run successfully 
the first time, but after that it will fail because it can't find a file in the 
folder BigQueryExtractTemp (see error message below). If the template is 
uploaded again it will work again +once only+ and then fail again every time 
after the first time.

*Error message:*
 java.io.FileNotFoundException: No files matched spec: 
gs://bigquery-bug-report-4539/temp/BigQueryExtractTemp/847a342637a64e73b126ad33f764dcc9/000000000000.avro

*Steps to reproduce:*
 1. Create the Beam Word Count sample as described 
[here|https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven].

2. Copy the command line from the section "Run WordCount on the Cloud Dataflow 
service" and substitute in your own project id and bucket name. Make sure you 
can run it successfully.

3. In the file WordCount.java, add the following lines below the existing 
import statements:
{code:java}
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.transforms.SerializableFunction;

@DefaultCoder(AvroCoder.class)
class TestOutput
{ 
}
{code}
 
 4. In this same file, replace the entire method runWordCount with the 
following code:
{code:java}
static void runWordCount(WordCountOptions options) {
  Pipeline p = Pipeline.create(options);

  p.apply("ReadBigQuery", BigQueryIO
    .read(new SerializableFunction<SchemaAndRecord, TestOutput>() {
      public TestOutput apply(SchemaAndRecord record) {
        return new TestOutput();
      }
    })
    .from("bigquery-public-data:stackoverflow.tags")
  );

  p.run();
}
{code}
(Note I am using the stackoverflow.tags table for purposes of demonstration 
because it is public and not too large, but the problem seems to occur for any 
table).

5. Add the following pipeline parameters to the command line that you have been 
using:
{code:java}
--tempLocation=gs://<STORAGE_BUCKET>/temp/
--templateLocation=gs://<STORAGE_BUCKET>/my-bigquery-dataflow-template
{code}
6. Run the command line so that the template is created.

7. Launch the template through the Cloud Console by clicking on "CREATE JOB 
FROM TEMPLATE". Give it the job name "test-1", choose "Custom Template" at the 
bottom of the list and browse to the template "my-bigquery-dataflow-template", 
then press "Run job".

8. The job should succeed. But then repeat step 7 and it will fail.

9. Repeat steps 6 and 7 and it will work again. Repeat step 7 and it will fail 
again.

 

This bug may be related to BEAM-2058 (just a hunch).

  was:
When a pipeline contains a BigQuery read, and when that pipeline is uploaded as 
a template and the template is run in Cloud Dataflow, it will run successfully 
the first time, but after that it will fail because it can't find a file in the 
folder BigQueryExtractTemp (see error message below). If the template is 
uploaded again it will work again +once only+ and then fail again every time 
after the first time.

*Error message:*
 java.io.FileNotFoundException: No files matched spec: 
gs://bigquery-bug-report-4539/temp/BigQueryExtractTemp/847a342637a64e73b126ad33f764dcc9/000000000000.avro

*Steps to reproduce:*
 1. Create the Beam Word Count sample as described 
[here|https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven].

2. Copy the command line from the section "Run WordCount on the Cloud Dataflow 
service" and substitute in your own project id and bucket name. Make sure you 
can run it successfully.

3. In the file WordCount.java, add the following lines below the existing 
import statements:
{code:java}
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.transforms.SerializableFunction;

@DefaultCoder(AvroCoder.class)
class TestOutput
{ 
}
{code}
 
 4. In this same file, replace the entire method runWordCount with the 
following code:
{code:java}
static void runWordCount(WordCountOptions options) {
  Pipeline p = Pipeline.create(options);

  p.apply("ReadBigQuery", BigQueryIO
    .read(new SerializableFunction<SchemaAndRecord, TestOutput>() {
      public TestOutput apply(SchemaAndRecord record) {
        return new TestOutput();
      }
    })
    .from("bigquery-public-data:stackoverflow.tags")
  );

  p.run();
}
{code}
(Note I am using the stackoverflow.tags table for purposes of demonstration 
because it is public and not too large, but the problem seems to occur for any 
table).

5. Add the following pipeline parameters to the command line that you have been 
using:
{code:java}
--tempLocation=gs://<STORAGE_BUCKET>/temp/
--templateLocation=gs://<STORAGE_BUCKET>/my-bigquery-dataflow-template
{code}
6. Run the command line so that the template is created.

7. Launch the template through the Cloud Console by clicking on "CREATE JOB 
FROM TEMPLATE". Give it the job name "test-1", choose "Custom Template" at the 
bottom of the list and browse to the template "my-bigquery-dataflow-template", 
then press "Run job".

8. The job should succeed. But then repeat step 7 and it will fail.

9. Repeat steps 6 and 7 and it will work again. Repeat step 7 and it will fail 
again.

This problem may be related to BEAM-2058 (just a hunch).


> Dataflow template which reads from BigQuery fails if used more than once
> ------------------------------------------------------------------------
>
>                 Key: BEAM-6206
>                 URL: https://issues.apache.org/jira/browse/BEAM-6206
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.8.0
>            Reporter: Neil McCrossin
>            Assignee: Tyler Akidau
>            Priority: Major
>
> When a pipeline contains a BigQuery read, and when that pipeline is uploaded 
> as a template and the template is run in Cloud Dataflow, it will run 
> successfully the first time, but after that it will fail because it can't 
> find a file in the folder BigQueryExtractTemp (see error message below). If 
> the template is uploaded again it will work again +once only+ and then fail 
> again every time after the first time.
> *Error message:*
>  java.io.FileNotFoundException: No files matched spec: 
> gs://bigquery-bug-report-4539/temp/BigQueryExtractTemp/847a342637a64e73b126ad33f764dcc9/000000000000.avro
> *Steps to reproduce:*
>  1. Create the Beam Word Count sample as described 
> [here|https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven].
> 2. Copy the command line from the section "Run WordCount on the Cloud 
> Dataflow service" and substitute in your own project id and bucket name. Make 
> sure you can run it successfully.
> 3. In the file WordCount.java, add the following lines below the existing 
> import statements:
> {code:java}
> import org.apache.beam.sdk.coders.AvroCoder;
> import org.apache.beam.sdk.coders.DefaultCoder;
> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
> import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
> import org.apache.beam.sdk.transforms.SerializableFunction;
> @DefaultCoder(AvroCoder.class)
> class TestOutput
> { 
> }
> {code}
>  
>  4. In this same file, replace the entire method runWordCount with the 
> following code:
> {code:java}
> static void runWordCount(WordCountOptions options) {
>   Pipeline p = Pipeline.create(options);
>   p.apply("ReadBigQuery", BigQueryIO
>     .read(new SerializableFunction<SchemaAndRecord, TestOutput>() {
>       public TestOutput apply(SchemaAndRecord record) {
>         return new TestOutput();
>       }
>     })
>     .from("bigquery-public-data:stackoverflow.tags")
>   );
>   p.run();
> }
> {code}
> (Note I am using the stackoverflow.tags table for purposes of demonstration 
> because it is public and not too large, but the problem seems to occur for 
> any table).
> 5. Add the following pipeline parameters to the command line that you have 
> been using:
> {code:java}
> --tempLocation=gs://<STORAGE_BUCKET>/temp/
> --templateLocation=gs://<STORAGE_BUCKET>/my-bigquery-dataflow-template
> {code}
> 6. Run the command line so that the template is created.
> 7. Launch the template through the Cloud Console by clicking on "CREATE JOB 
> FROM TEMPLATE". Give it the job name "test-1", choose "Custom Template" at 
> the bottom of the list and browse to the template 
> "my-bigquery-dataflow-template", then press "Run job".
> 8. The job should succeed. But then repeat step 7 and it will fail.
> 9. Repeat steps 6 and 7 and it will work again. Repeat step 7 and it will 
> fail again.
>  
> This bug may be related to BEAM-2058 (just a hunch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-6206) Dataflow template which reads from BigQuery fails if used more than once

Reply via email to