TheNeuralBit commented on code in PR #22616:
URL: https://github.com/apache/beam/pull/22616#discussion_r956055366
##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -756,3 +784,59 @@ def expand(self, pcoll):
| beam.Map(lambda file_result:
file_result.file_name).with_output_types(
str)
}
+
+
+class _ReadGbq(beam.PTransform):
+ """Read data from BigQuery with output type 'BEAM_ROW',
+ then convert it into a deferred dataframe.
+
+ This PTransform wraps the Python ReadFromBigQuery PTransform,
+ and sets the output_type as 'BEAM_ROW' to convert
+ into a Beam Schema. Once applied to a pipeline object,
+ it is passed into the to_dataframe() function to convert the
+ PCollection into a deferred dataframe.
+
+ This PTransform currently does not support queries.
+
+ Args:
+ table (str): The ID of the table. The ID must contain only
+ letters ``a-z``, ``A-Z``,
+ numbers ``0-9``, underscores ``_`` or white spaces.
+ Note that the table argument must contain the entire table
+ reference specified as: ``'PROJECT:DATASET.TABLE'``.
+ use_bq_storage_api (bool): The method to use to read from BigQuery.
+ It may be 'EXPORT' or
+ 'DIRECT_READ'. EXPORT invokes a BigQuery export request
+ (https://cloud.google.com/bigquery/docs/exporting-data).
+ 'DIRECT_READ' reads
+ directly from BigQuery storage using the BigQuery Read API
+ (https://cloud.google.com/bigquery/docs/reference/storage). If
+ unspecified or set to false, the default is currently utilized (EXPORT).
+ If the flag is set to true,
+ 'DIRECT_READ' will be utilized."""
Review Comment:
nit: I think the second line should be indented here
```suggestion
table (str): The ID of the table. The ID must contain only
letters ``a-z``, ``A-Z``,
numbers ``0-9``, underscores ``_`` or white spaces.
Note that the table argument must contain the entire table
reference specified as: ``'PROJECT:DATASET.TABLE'``.
use_bq_storage_api (bool): The method to use to read from BigQuery.
It may be 'EXPORT' or
'DIRECT_READ'. EXPORT invokes a BigQuery export request
(https://cloud.google.com/bigquery/docs/exporting-data).
'DIRECT_READ' reads
directly from BigQuery storage using the BigQuery Read API
(https://cloud.google.com/bigquery/docs/reference/storage). If
unspecified or set to false, the default is currently utilized
(EXPORT).
If the flag is set to true,
'DIRECT_READ' will be utilized."""
```
##########
sdks/python/apache_beam/dataframe/io_it_test.py:
##########
@@ -0,0 +1,123 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
Review Comment:
```suggestion
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
```
I think this will make the RAT check happy. To confirm you could run
`./gradlew rat` locally and inspect the output (`build/reports/rat/index.html`).
##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -58,6 +58,28 @@
_DEFAULT_BYTES_CHUNKSIZE = 1 << 20
+def read_gbq(
+ table, dataset=None, project_id=None, use_bqstorage_api=False, **kwargs):
+ """This function reads data from a BigQuery source and outputs it into
+ a Beam deferred dataframe
+ (https://beam.apache.org/documentation/dsls/dataframes/overview/)
+ Please specify a table in the format 'PROJECT:dataset.table'
+ or use the table, dataset, and project_id args
+ to specify the table. If you would like to utilize the BigQuery
+ Storage API in ReadFromBigQuery,
+ please set use_bq_storage_api to True.
+ Otherwise, please set the flag to false or
+ leave it unspecified."""
+ if table is None:
+ raise ValueError("Please specify a BigQuery table to read from.")
+ elif len(kwargs) > 0:
+ raise ValueError(
+ "Unsupported parameter entered in read_gbq. Please enter only "
+ "supported parameters 'table', 'dataset', "
+ "'project_id', 'use_bqstorage_api'.")
Review Comment:
What about using `kwargs.keys()` as suggested previously?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]