[jira] [Work logged] (BEAM-8841) Add ability to perform BigQuery file loads using avro

ASF GitHub Bot (Jira) Wed, 26 Feb 2020 09:39:29 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-8841?focusedWorklogId=393645&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-393645
 ]


ASF GitHub Bot logged work on BEAM-8841:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Feb/20 17:38
            Start Date: 26/Feb/20 17:38
    Worklog Time Spent: 10m 
      Work Description: chunyang commented on pull request #10979: [BEAM-8841] 
Support writing data to BigQuery via Avro in Python SDK
URL: https://github.com/apache/beam/pull/10979#discussion_r384653061
 
 

 ##########
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##########
 @@ -1361,87 +1369,18 @@ def __init__(
     self.triggering_frequency = triggering_frequency
     self.insert_retry_strategy = insert_retry_strategy
     self._validate = validate
+    self._temp_file_format = temp_file_format or bigquery_tools.FileFormat.JSON
 
     self.additional_bq_parameters = additional_bq_parameters or {}
     self.table_side_inputs = table_side_inputs or ()
     self.schema_side_inputs = schema_side_inputs or ()
 
-  @staticmethod
-  def get_table_schema_from_string(schema):
-    """Transform the string table schema into a
-    :class:`~apache_beam.io.gcp.internal.clients.bigquery.\
-bigquery_v2_messages.TableSchema` instance.
-
-    Args:
-      schema (str): The sting schema to be used if the BigQuery table to write
-        has to be created.
-
-    Returns:
-      ~apache_beam.io.gcp.internal.clients.bigquery.\
-bigquery_v2_messages.TableSchema:
-      The schema to be used if the BigQuery table to write has to be created
-      but in the :class:`~apache_beam.io.gcp.internal.clients.bigquery.\
-bigquery_v2_messages.TableSchema` format.
-    """
-    table_schema = bigquery.TableSchema()
-    schema_list = [s.strip() for s in schema.split(',')]
-    for field_and_type in schema_list:
-      field_name, field_type = field_and_type.split(':')
-      field_schema = bigquery.TableFieldSchema()
-      field_schema.name = field_name
-      field_schema.type = field_type
-      field_schema.mode = 'NULLABLE'
-      table_schema.fields.append(field_schema)
-    return table_schema
-
-  @staticmethod
-  def table_schema_to_dict(table_schema):
-    """Create a dictionary representation of table schema for serialization
-    """
-    def get_table_field(field):
-      """Create a dictionary representation of a table field
-      """
-      result = {}
-      result['name'] = field.name
-      result['type'] = field.type
-      result['mode'] = getattr(field, 'mode', 'NULLABLE')
-      if hasattr(field, 'description') and field.description is not None:
-        result['description'] = field.description
-      if hasattr(field, 'fields') and field.fields:
-        result['fields'] = [get_table_field(f) for f in field.fields]
-      return result
-
-    if not isinstance(table_schema, bigquery.TableSchema):
-      raise ValueError("Table schema must be of the type bigquery.TableSchema")
-    schema = {'fields': []}
-    for field in table_schema.fields:
-      schema['fields'].append(get_table_field(field))
-    return schema
-
-  @staticmethod
-  def get_dict_table_schema(schema):
-    """Transform the table schema into a dictionary instance.
-
-    Args:
-      schema (~apache_beam.io.gcp.internal.clients.bigquery.\
-bigquery_v2_messages.TableSchema):
-        The schema to be used if the BigQuery table to write has to be created.
-        This can either be a dict or string or in the TableSchema format.
-
-    Returns:
-      Dict[str, Any]: The schema to be used if the BigQuery table to write has
-      to be created but in the dictionary format.
-    """
-    if (isinstance(schema, (dict, vp.ValueProvider)) or callable(schema) or
-        schema is None):
-      return schema
-    elif isinstance(schema, (str, unicode)):
-      table_schema = WriteToBigQuery.get_table_schema_from_string(schema)
-      return WriteToBigQuery.table_schema_to_dict(table_schema)
-    elif isinstance(schema, bigquery.TableSchema):
-      return WriteToBigQuery.table_schema_to_dict(schema)
-    else:
-      raise TypeError('Unexpected schema argument: %s.' % schema)
+  # Dict/schema methods were moved to bigquery_tools, but keep references
+  # here for backward compatibility.
+  get_table_schema_from_string = \
+      staticmethod(bigquery_tools.get_table_schema_from_string)
+  table_schema_to_dict = staticmethod(bigquery_tools.table_schema_to_dict)
+  get_dict_table_schema = staticmethod(bigquery_tools.get_dict_table_schema)
 
 Review comment:
   Moved these to avoid a cyclic import.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 393645)
    Time Spent: 40m  (was: 0.5h)

> Add ability to perform BigQuery file loads using avro
> -----------------------------------------------------
>
>                 Key: BEAM-8841
>                 URL: https://issues.apache.org/jira/browse/BEAM-8841
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-py-gcp
>            Reporter: Chun Yang
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, JSON format is used for file loads into BigQuery in the Python 
> SDK. JSON has some disadvantages including size of serialized data and 
> inability to represent NaN and infinity float values.
> BigQuery supports loading files in avro format, which can overcome these 
> disadvantages. The Java SDK already supports loading files using avro format 
> (BEAM-2879) so it makes sense to support it in the Python SDK as well.
> The change will be somewhere around 
> [{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8841) Add ability to perform BigQuery file loads using avro

Reply via email to