[
https://issues.apache.org/jira/browse/BEAM-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717845#comment-16717845
]
Neil McCrossin commented on BEAM-6206:
--------------------------------------
Probably the same issue as BEAM-5434.
> Dataflow template which reads from BigQuery fails if used more than once
> ------------------------------------------------------------------------
>
> Key: BEAM-6206
> URL: https://issues.apache.org/jira/browse/BEAM-6206
> Project: Beam
> Issue Type: Bug
> Components: io-java-gcp, runner-dataflow
> Affects Versions: 2.8.0
> Reporter: Neil McCrossin
> Assignee: Tyler Akidau
> Priority: Major
>
> When a pipeline contains a BigQuery read, and when that pipeline is uploaded
> as a template and the template is run in Cloud Dataflow, it will run
> successfully the first time, but after that it will fail because it can't
> find a file in the folder BigQueryExtractTemp (see error message below). If
> the template is uploaded again it will work again +once only+ and then fail
> again every time after the first time.
> *Error message:*
> java.io.FileNotFoundException: No files matched spec:
> gs://bigquery-bug-report-4539/temp/BigQueryExtractTemp/847a342637a64e73b126ad33f764dcc9/000000000000.avro
> *Steps to reproduce:*
> 1. Create the Beam Word Count sample as described
> [here|https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven].
> 2. Copy the command line from the section "Run WordCount on the Cloud
> Dataflow service" and substitute in your own project id and bucket name. Make
> sure you can run it successfully.
> 3. In the file WordCount.java, add the following lines below the existing
> import statements:
> {code:java}
> import org.apache.beam.sdk.coders.AvroCoder;
> import org.apache.beam.sdk.coders.DefaultCoder;
> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
> import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
> import org.apache.beam.sdk.transforms.SerializableFunction;
> @DefaultCoder(AvroCoder.class)
> class TestOutput
> {
> }
> {code}
>
> 4. In this same file, replace the entire method runWordCount with the
> following code:
> {code:java}
> static void runWordCount(WordCountOptions options) {
> Pipeline p = Pipeline.create(options);
> p.apply("ReadBigQuery", BigQueryIO
> .read(new SerializableFunction<SchemaAndRecord, TestOutput>() {
> public TestOutput apply(SchemaAndRecord record) {
> return new TestOutput();
> }
> })
> .from("bigquery-public-data:stackoverflow.tags")
> );
> p.run();
> }
> {code}
> (Note I am using the stackoverflow.tags table for purposes of demonstration
> because it is public and not too large, but the problem seems to occur for
> any table).
> 5. Add the following pipeline parameters to the command line that you have
> been using:
> {code:java}
> --tempLocation=gs://<STORAGE_BUCKET>/temp/
> --templateLocation=gs://<STORAGE_BUCKET>/my-bigquery-dataflow-template
> {code}
> 6. Run the command line so that the template is created.
> 7. Launch the template through the Cloud Console by clicking on "CREATE JOB
> FROM TEMPLATE". Give it the job name "test-1", choose "Custom Template" at
> the bottom of the list and browse to the template
> "my-bigquery-dataflow-template", then press "Run job".
> 8. The job should succeed. But then repeat step 7 and it will fail.
> 9. Repeat steps 6 and 7 and it will work again. Repeat step 7 and it will
> fail again.
>
> This bug may be related to BEAM-2058 (just a hunch).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)