[jira] [Commented] (BEAM-7173) Bigquery connector should not enable schema without a user explicitly instructing to do so.

Valentyn Tymofieiev (JIRA) Mon, 29 Apr 2019 19:14:12 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829884#comment-16829884
 ]


Valentyn Tymofieiev commented on BEAM-7173:
-------------------------------------------

I was curious to check if this can be fixed with a one-liner: 
https://github.com/apache/beam/pull/8428 and a bunch of BQ tests failed. I 
suspect the tests need to pass a schema initially when a table is first 
created. 
In particular the following failure is concerning: it looks like 
BigQueryStreamingInsertTransformIntegrationTests exercises 
BigQueryBatchFileLoads codepath. Do you know why would that be the case, 
[~pabloem]? I thought BigQueryBatchFileLoads is disabled by default.

{noformat}
12:44:53 ======================================================================
12:44:53 ERROR: test_value_provider_transform 
(apache_beam.io.gcp.bigquery_test.BigQueryStreamingInsertTransformIntegrationTests)
12:44:53 ----------------------------------------------------------------------
12:44:53 Traceback (most recent call last):
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/bigquery_test.py",
 line 534, in test_value_provider_transform
12:44:53     method='FILE_LOADS'))
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
 line 426, in __exit__
12:44:53     self.run().wait_until_finish()
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
 line 406, in run
12:44:53     self._options).run(False)
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
 line 419, in run
12:44:53     return self.runner.run_pipeline(self, self._options)
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
 line 64, in run_pipeline
12:44:53     self.result.wait_until_finish(duration=wait_duration)
12:44:53   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
 line 1240, in wait_until_finish
12:44:53     (self.state, getattr(self._runner, 'last_error_msg', None)), self)
12:44:53 DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, 
Error:
12:44:53 Traceback (most recent call last):
12:44:53   File 
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
649, in do_work
12:44:53     work_executor.execute()
12:44:53   File 
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 176, 
in execute
12:44:53     op.start()
12:44:53   File "dataflow_worker/native_operations.py", line 38, in 
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53     def start(self):
12:44:53   File "dataflow_worker/native_operations.py", line 39, in 
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53     with self.scoped_start_state:
12:44:53   File "dataflow_worker/native_operations.py", line 44, in 
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53     with self.spec.source.reader() as reader:
12:44:53   File "dataflow_worker/native_operations.py", line 54, in 
dataflow_worker.native_operations.NativeReadOperation.start
12:44:53     self.output(windowed_value)
12:44:53   File "apache_beam/runners/worker/operations.py", line 246, in 
apache_beam.runners.worker.operations.Operation.output
12:44:53     cython.cast(Receiver, 
self.receivers[output_index]).receive(windowed_value)
12:44:53   File "apache_beam/runners/worker/operations.py", line 142, in 
apache_beam.runners.worker.operations.SingletonConsumerSet.receive
12:44:53     self.consumer.process(windowed_value)
12:44:53   File "apache_beam/runners/worker/operations.py", line 560, in 
apache_beam.runners.worker.operations.DoOperation.process
12:44:53     with self.scoped_process_state:
12:44:53   File "apache_beam/runners/worker/operations.py", line 561, in 
apache_beam.runners.worker.operations.DoOperation.process
12:44:53     delayed_application = self.dofn_receiver.receive(o)
12:44:53   File "apache_beam/runners/common.py", line 747, in 
apache_beam.runners.common.DoFnRunner.receive
12:44:53     self.process(windowed_value)
12:44:53   File "apache_beam/runners/common.py", line 753, in 
apache_beam.runners.common.DoFnRunner.process
12:44:53     self._reraise_augmented(exn)
12:44:53   File "apache_beam/runners/common.py", line 807, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
12:44:53     raise_with_traceback(new_exn)
12:44:53   File "apache_beam/runners/common.py", line 751, in 
apache_beam.runners.common.DoFnRunner.process
12:44:53     return self.do_fn_invoker.invoke_process(windowed_value)
12:44:53   File "apache_beam/runners/common.py", line 563, in 
apache_beam.runners.common.PerWindowInvoker.invoke_process
12:44:53     self._invoke_process_per_window(
12:44:53   File "apache_beam/runners/common.py", line 635, in 
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
12:44:53     windowed_value, self.process_method(*args_for_process))
12:44:53   File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery_file_loads.py",
 line 432, in process
12:44:53     'BigQuery jobs failed. BQ error: %s', self._latest_error)
12:44:53 Exception: (u"BigQuery jobs failed. BQ error: %s [while running 
'WriteWithMultipleDests2/BigQueryBatchFileLoads/WaitForLoadJobs/WaitForLoadJobs']",
 <JobStatus
12:44:53  errorResult: <ErrorProto
12:44:53  message: u'No schema specified on job or table.'
{noformat}


> Bigquery connector should not enable schema without a user explicitly 
> instructing to do so. 
> --------------------------------------------------------------------------------------------
>
>                 Key: BEAM-7173
>                 URL: https://issues.apache.org/jira/browse/BEAM-7173
>             Project: Beam
>          Issue Type: Bug
>          Components: io-python-gcp
>            Reporter: Valentyn Tymofieiev
>            Assignee: Pablo Estrada
>            Priority: Major
>
> Currently BQ_FILE_LOADS insertion method enables schema autodetection: 
> [https://github.com/apache/beam/blob/6567f1687d53e491b337ba94f521fa2e4af35e46/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L340]
>  It may be more user-friendly allow users to opt-in for schema autodetection 
> in their pipelines across all use-cases for BQ connector. Schema 
> autodetection is an approximation, and does not always work.
> For example, schema autodetection cannot infer whether a string data is 
> binary bytes or textual string, and will always prefer the latter. If schema 
> autodetection is enabled by default, users who need to write 'bytes' data 
> will always have to specify a schema, even when writing to a table that was 
> already created and has the schema. Otherwise autodetected schema will try to 
> write 'string' entry into a 'bytes' field and the write will fail.
> Related discussion: 
> [https://lists.apache.org/thread.html/1f9d9cb1bbbfca87d74e62ba8e58a15059ed6c20ab419002fcd3f8df@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7173) Bigquery connector should not enable schema without a user explicitly instructing to do so.

Reply via email to