[ 
https://issues.apache.org/jira/browse/BEAM-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844914#comment-16844914
 ] 

Juta Staes commented on BEAM-7173:
----------------------------------

I am working on writing it tests for bigquery io.
 When testing the schema auto detection I get:
{code:java}
ERROR: test_big_query_write_schema_autodetect 
(apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests)*12:41:01*
 
----------------------------------------------------------------------*12:41:01*
 Traceback (most recent call last):*12:41:01*   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/io/gcp/bigquery_write_it_test.py",
 line 156, in test_big_query_write_schema_autodetect*12:41:01*     
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY))*12:41:01*   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
 line 426, in __exit__*12:41:01*     self.run().wait_until_finish()*12:41:01*   
File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/pipeline.py",
 line 419, in run*12:41:01*     return self.runner.run_pipeline(self, 
self._options)*12:41:01*   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
 line 64, in run_pipeline*12:41:01*     
self.result.wait_until_finish(duration=wait_duration)*12:41:01*   File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
 line 1322, in wait_until_finish*12:41:01*     (self.state, 
getattr(self._runner, 'last_error_msg', None)), self)*12:41:01* 
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow 
pipeline failed. State: FAILED, Error:*12:41:01* Workflow failed. Causes: 
S01:create/Read+write/WriteToBigQuery/NativeWrite failed., BigQuery import job 
"dataflow_job_18059625072014532771-B" failed., BigQuery job 
"dataflow_job_18059625072014532771-B" in project "apache-beam-testing" finished 
with error(s): errorResult: No schema specified on job or table., error: No 
schema specified on job or table.
{code}
test code:
{code:java}
input_data = [
    {'number': 1, 'str': 'abc'},
    {'number': 2, 'str': 'def'},
]

with beam.Pipeline(argv=args) as p:
  (p | 'create' >> beam.Create(input_data)
   | 'write' >> beam.io.WriteToBigQuery(
       output_table,
       schema=beam.io.gcp.bigquery.SCHEMA_AUTODETECT,
       create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
       write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY))
{code}
Is there something wrong with my test or is this a bug?

link to pr: [https://github.com/apache/beam/pull/8621]
 cc: [~tvalentyn] [~pabloem]

> Bigquery connector should not enable schema autodetection without a user 
> explicitly instructing to do so. 
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-7173
>                 URL: https://issues.apache.org/jira/browse/BEAM-7173
>             Project: Beam
>          Issue Type: Bug
>          Components: io-python-gcp
>            Reporter: Valentyn Tymofieiev
>            Assignee: Pablo Estrada
>            Priority: Major
>             Fix For: 2.13.0
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Currently BQ_FILE_LOADS insertion method enables schema autodetection: 
> [https://github.com/apache/beam/blob/6567f1687d53e491b337ba94f521fa2e4af35e46/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L340]
>  It may be more user-friendly allow users to opt-in for schema autodetection 
> in their pipelines across all use-cases for BQ connector. Schema 
> autodetection is an approximation, and does not always work.
> For example, schema autodetection cannot infer whether a string data is 
> binary bytes or textual string, and will always prefer the latter. If schema 
> autodetection is enabled by default, users who need to write 'bytes' data 
> will always have to specify a schema, even when writing to a table that was 
> already created and has the schema. Otherwise autodetected schema will try to 
> write 'string' entry into a 'bytes' field and the write will fail.
> Related discussion: 
> [https://lists.apache.org/thread.html/1f9d9cb1bbbfca87d74e62ba8e58a15059ed6c20ab419002fcd3f8df@%3Cdev.beam.apache.org%3E]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to