[
https://issues.apache.org/jira/browse/BEAM-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Kirpichov closed BEAM-3067.
----------------------------------
Resolution: Fixed
Assignee: Reuven Lax (was: Thomas Groh)
Fix Version/s: 2.2.0
> BigQueryIO.Write fails on empty PCollection with DirectRunner (batch job)
> -------------------------------------------------------------------------
>
> Key: BEAM-3067
> URL: https://issues.apache.org/jira/browse/BEAM-3067
> Project: Beam
> Issue Type: Bug
> Components: runner-direct, sdk-java-gcp
> Affects Versions: 2.1.0
> Environment: Arch Linux, Java 1.8.0_144
> Reporter: Dmitry Bigunyak
> Assignee: Reuven Lax
> Priority: Major
> Fix For: 2.2.0
>
>
> I'm using side output feature to filter out malformatted events (errors) from
> a stream of valid events. Then I save valid events into one BigQuery table
> and errors go into another dedicated table.
> Here is the code for outputting error rows:
> {code:java}
> invalidEventRows.apply("WriteErrors", BigQueryIO.writeTableRows()
> .to(errorTableRef)
> .withSchema(ProcessEvents.getErrorSchema())
>
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
> {code}
> The problem is that when running on DirectRunner in a batch mode (reading
> input from a file) and {{invalidEventRows}} PCollection ends up being empty
> (all events are valid -- no errors), I get the following error:
> {code}
> [ERROR] "status" : {
> [ERROR] "errorResult" : {
> [ERROR] "message" : "No schema specified on job or table.",
> [ERROR] "reason" : "invalid"
> [ERROR] },
> [ERROR] "errors" : [ {
> [ERROR] "message" : "No schema specified on job or table.",
> [ERROR] "reason" : "invalid"
> [ERROR] } ],
> [ERROR] "state" : "DONE"
> [ERROR] },
> {code}
> There are no errors when executing the same code and {{invalidEventRows}}
> PCollection is not empty, the BigQuery table is created and the data are
> correctly inserted.
> Also everything seems to be working fine in a streaming mode (reading from
> Pub/Sub) on both DirectRunner and DataflowRunner.
> Looks like a bug?
> Or should I open an issue in GoogleCloudPlatform/DataflowJavaSDK github
> project?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)