[
https://issues.apache.org/jira/browse/BEAM-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829884#comment-16829884
]
Valentyn Tymofieiev commented on BEAM-7173:
-------------------------------------------
I was curious to check if this can be fixed with a one-liner:
https://github.com/apache/beam/pull/8428 and a bunch of BQ tests failed. I
suspect the tests need to pass a schema initially when a table is first
created.
In particular the following failure is concerning: it looks like
BigQueryStreamingInsertTransformIntegrationTests exercises
BigQueryBatchFileLoads codepath. Do you know why would that be the case,
[~pabloem]? I thought BigQueryBatchFileLoads is disabled by default.
{noformat}
12:44:53 ======================================================================
12:44:53 ERROR: test_value_provider_transform
(apache_beam.io.gcp.bigquery_test.BigQueryStreamingInsertTransformIntegrationTests)
12:44:53 ----------------------------------------------------------------------
12:44:53 Traceback (most recent call last):
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/bigquery_test.py",
line 534, in test_value_provider_transform
12:44:53 method='FILE_LOADS'))
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
line 426, in __exit__
12:44:53 self.run().wait_until_finish()
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
line 406, in run
12:44:53 self._options).run(False)
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
line 419, in run
12:44:53 return self.runner.run_pipeline(self, self._options)
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
line 64, in run_pipeline
12:44:53 self.result.wait_until_finish(duration=wait_duration)
12:44:53 File
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
line 1240, in wait_until_finish
12:44:53 (self.state, getattr(self._runner, 'last_error_msg', None)), self)
12:44:53 DataflowRuntimeException: Dataflow pipeline failed. State: FAILED,
Error:
12:44:53 Traceback (most recent call last):
12:44:53 File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line
649, in do_work
12:44:53 work_executor.execute()
12:44:53 File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 176,
in execute
12:44:53 op.start()
12:44:53 File "dataflow_worker/native_operations.py", line 38, in
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53 def start(self):
12:44:53 File "dataflow_worker/native_operations.py", line 39, in
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53 with self.scoped_start_state:
12:44:53 File "dataflow_worker/native_operations.py", line 44, in
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53 with self.spec.source.reader() as reader:
12:44:53 File "dataflow_worker/native_operations.py", line 54, in
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53 self.output(windowed_value)
12:44:53 File "apache_beam/runners/worker/operations.py", line 246, in
apache_beam.runners.worker.operations.Operation.output
12:44:53 cython.cast(Receiver,
self.receivers[output_index]).receive(windowed_value)
12:44:53 File "apache_beam/runners/worker/operations.py", line 142, in
apache_beam.runners.worker.operations.SingletonConsumerSet.receive
12:44:53 self.consumer.process(windowed_value)
12:44:53 File "apache_beam/runners/worker/operations.py", line 560, in
apache_beam.runners.worker.operations.DoOperation.process
12:44:53 with self.scoped_process_state:
12:44:53 File "apache_beam/runners/worker/operations.py", line 561, in
apache_beam.runners.worker.operations.DoOperation.process
12:44:53 delayed_application = self.dofn_receiver.receive(o)
12:44:53 File "apache_beam/runners/common.py", line 747, in
apache_beam.runners.common.DoFnRunner.receive
12:44:53 self.process(windowed_value)
12:44:53 File "apache_beam/runners/common.py", line 753, in
apache_beam.runners.common.DoFnRunner.process
12:44:53 self._reraise_augmented(exn)
12:44:53 File "apache_beam/runners/common.py", line 807, in
apache_beam.runners.common.DoFnRunner._reraise_augmented
12:44:53 raise_with_traceback(new_exn)
12:44:53 File "apache_beam/runners/common.py", line 751, in
apache_beam.runners.common.DoFnRunner.process
12:44:53 return self.do_fn_invoker.invoke_process(windowed_value)
12:44:53 File "apache_beam/runners/common.py", line 563, in
apache_beam.runners.common.PerWindowInvoker.invoke_process
12:44:53 self._invoke_process_per_window(
12:44:53 File "apache_beam/runners/common.py", line 635, in
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
12:44:53 windowed_value, self.process_method(*args_for_process))
12:44:53 File
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery_file_loads.py",
line 432, in process
12:44:53 'BigQuery jobs failed. BQ error: %s', self._latest_error)
12:44:53 Exception: (u"BigQuery jobs failed. BQ error: %s [while running
'WriteWithMultipleDests2/BigQueryBatchFileLoads/WaitForLoadJobs/WaitForLoadJobs']",
<JobStatus
12:44:53 errorResult: <ErrorProto
12:44:53 message: u'No schema specified on job or table.'
{noformat}
> Bigquery connector should not enable schema without a user explicitly
> instructing to do so.
> --------------------------------------------------------------------------------------------
>
> Key: BEAM-7173
> URL: https://issues.apache.org/jira/browse/BEAM-7173
> Project: Beam
> Issue Type: Bug
> Components: io-python-gcp
> Reporter: Valentyn Tymofieiev
> Assignee: Pablo Estrada
> Priority: Major
>
> Currently BQ_FILE_LOADS insertion method enables schema autodetection:
> [https://github.com/apache/beam/blob/6567f1687d53e491b337ba94f521fa2e4af35e46/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L340]
> It may be more user-friendly allow users to opt-in for schema autodetection
> in their pipelines across all use-cases for BQ connector. Schema
> autodetection is an approximation, and does not always work.
> For example, schema autodetection cannot infer whether a string data is
> binary bytes or textual string, and will always prefer the latter. If schema
> autodetection is enabled by default, users who need to write 'bytes' data
> will always have to specify a schema, even when writing to a table that was
> already created and has the schema. Otherwise autodetected schema will try to
> write 'string' entry into a 'bytes' field and the write will fail.
> Related discussion:
> [https://lists.apache.org/thread.html/1f9d9cb1bbbfca87d74e62ba8e58a15059ed6c20ab419002fcd3f8df@%3Cdev.beam.apache.org%3E]
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)