[
https://issues.apache.org/jira/browse/BEAM-7173?focusedWorklogId=237892&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-237892
]
ASF GitHub Bot logged work on BEAM-7173:
----------------------------------------
Author: ASF GitHub Bot
Created on: 06/May/19 16:37
Start Date: 06/May/19 16:37
Worklog Time Spent: 10m
Work Description: tvalentyn commented on pull request #8473: [BEAM-7173]
Avoiding schema autodetection by default in WriteToBigQuery
URL: https://github.com/apache/beam/pull/8473#discussion_r281259841
##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -666,7 +666,12 @@ def _create_table_if_needed(self, table_reference,
schema=None):
logging.debug('Creating or getting table %s with schema %s.',
table_reference, schema)
- table_schema = self.get_table_schema(schema)
+ if schema == SCHEMA_AUTODETECT:
Review comment:
Are we assuming that this helper only gets called only in streaming inserts
codepath?
Perhaps we should strengthen the condition to something like: `if schema ==
SCHEMA_AUTODETECT and method == STREAMING_INSERTS`, or move this check
somewhere up the call stack to fail early?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 237892)
Time Spent: 50m (was: 40m)
> Bigquery connector should not enable schema autodetection without a user
> explicitly instructing to do so.
> ----------------------------------------------------------------------------------------------------------
>
> Key: BEAM-7173
> URL: https://issues.apache.org/jira/browse/BEAM-7173
> Project: Beam
> Issue Type: Bug
> Components: io-python-gcp
> Reporter: Valentyn Tymofieiev
> Assignee: Pablo Estrada
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Currently BQ_FILE_LOADS insertion method enables schema autodetection:
> [https://github.com/apache/beam/blob/6567f1687d53e491b337ba94f521fa2e4af35e46/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L340]
> It may be more user-friendly allow users to opt-in for schema autodetection
> in their pipelines across all use-cases for BQ connector. Schema
> autodetection is an approximation, and does not always work.
> For example, schema autodetection cannot infer whether a string data is
> binary bytes or textual string, and will always prefer the latter. If schema
> autodetection is enabled by default, users who need to write 'bytes' data
> will always have to specify a schema, even when writing to a table that was
> already created and has the schema. Otherwise autodetected schema will try to
> write 'string' entry into a 'bytes' field and the write will fail.
> Related discussion:
> [https://lists.apache.org/thread.html/1f9d9cb1bbbfca87d74e62ba8e58a15059ed6c20ab419002fcd3f8df@%3Cdev.beam.apache.org%3E]
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)