[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22616: Add initial read_gbq wrapper

GitBox Fri, 12 Aug 2022 16:36:20 -0700


TheNeuralBit commented on code in PR #22616:
URL: https://github.com/apache/beam/pull/22616#discussion_r944932683



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""

Review Comment:
   nit:
   ```suggestion
   """Integration tests for Dataframe sources and sinks."""
   ```



##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -58,6 +58,28 @@
 _DEFAULT_BYTES_CHUNKSIZE = 1 << 20
 
 
+def read_gbq(
+    table, dataset=None, project_id=None, use_bqstorage_api=False, **kwargs):
+  """This function reads data from a BigQuery source and outputs it into
+  a Beam deferred dataframe
+  (https://beam.apache.org/documentation/dsls/dataframes/overview/)
+  Please specify a table in the format 'PROJECT:dataset.table'
+  or use the table, dataset, and project_id args
+  to specify the table. If you would like to utilize the BigQuery
+  Storage API in ReadFromBigQuery,
+    please set use_bq_storage_api to True.
+    Otherwise, please set the flag to false or
+    leave it unspecified."""
+  if table is None:
+    raise ValueError("Please specify a BigQuery table to read from.")
+  elif len(kwargs) > 0:
+    raise ValueError(
+        "Unsupported parameter entered in read_gbq. Please enter only "
+        "supported parameters 'table', 'dataset', "
+        "'project_id', 'use_bqstorage_api'.")

Review Comment:
   nit: maybe print the contents of kwargs.keys() instead, to identify the 
specific unsupported parameters?



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""
+# pytype: skip-file
+
+import logging
+import unittest
+
+import pytest
+
+import apache_beam as beam
+import apache_beam.io.gcp.bigquery
+from apache_beam.io.gcp import bigquery_read_it_test
+from apache_beam.io.gcp import bigquery_schema_tools
+from apache_beam.io.gcp import bigquery_tools
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+# Protect against environments where bigquery library is not available.
+# pylint: disable=wrong-import-order, wrong-import-position
+# pylint: enable=wrong-import-order, wrong-import-position
+
+_LOGGER = logging.getLogger(__name__)
+
+
+class 
ReadUsingReadGbqTests(bigquery_read_it_test.BigQueryReadIntegrationTests):
+  @pytest.mark.it_postcommit
+  def test_ReadGbq(self):
+    from apache_beam.dataframe import convert
+    the_table = bigquery_tools.BigQueryWrapper().get_table(
+        project_id="apache-beam-testing",
+        dataset_id="beam_bigquery_io_test",
+        table_id="dfsqltable_3c7d6fd5_16e0460dfd0")
+    table = the_table.schema
+    utype = bigquery_schema_tools. \
+        generate_user_type_from_bq_schema(table)
+    with beam.Pipeline(argv=self.args) as p:
+      actual_df = p | apache_beam.dataframe.io.read_gbq(
+          table="apache-beam-testing:beam_bigquery_io_test."
+          "dfsqltable_3c7d6fd5_16e0460dfd0",
+          use_bqstorage_api=False)
+      assert_that(
+          convert.to_pcollection(actual_df),
+          equal_to([
+              utype(id=3, name='customer1', type='test'),
+              utype(id=1, name='customer1', type='test'),
+              utype(id=2, name='customer2', type='test'),
+              utype(id=4, name='customer2', type='test')
+          ]))

Review Comment:
   I think you should be able to just compare the result to tuple instances, 
then you wouldn't need to generate the utype above. Will this work?
   ```suggestion
             equal_to([
                 (3, 'customer1', 'test'),
                 (1, 'customer1', 'test'),
                 (2, 'customer2', 'test'),
                 (4, 'customer2', 'test')
             ]))
   ```



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""
+# pytype: skip-file
+
+import logging
+import unittest
+
+import pytest
+
+import apache_beam as beam
+import apache_beam.io.gcp.bigquery
+from apache_beam.io.gcp import bigquery_read_it_test
+from apache_beam.io.gcp import bigquery_schema_tools
+from apache_beam.io.gcp import bigquery_tools
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+# Protect against environments where bigquery library is not available.
+# pylint: disable=wrong-import-order, wrong-import-position
+# pylint: enable=wrong-import-order, wrong-import-position
+
+_LOGGER = logging.getLogger(__name__)
+
+
+class 
ReadUsingReadGbqTests(bigquery_read_it_test.BigQueryReadIntegrationTests):
+  @pytest.mark.it_postcommit
+  def test_ReadGbq(self):
+    from apache_beam.dataframe import convert
+    the_table = bigquery_tools.BigQueryWrapper().get_table(
+        project_id="apache-beam-testing",
+        dataset_id="beam_bigquery_io_test",
+        table_id="dfsqltable_3c7d6fd5_16e0460dfd0")
+    table = the_table.schema
+    utype = bigquery_schema_tools. \
+        generate_user_type_from_bq_schema(table)
+    with beam.Pipeline(argv=self.args) as p:
+      actual_df = p | apache_beam.dataframe.io.read_gbq(
+          table="apache-beam-testing:beam_bigquery_io_test."
+          "dfsqltable_3c7d6fd5_16e0460dfd0",
+          use_bqstorage_api=False)
+      assert_that(
+          convert.to_pcollection(actual_df),
+          equal_to([
+              utype(id=3, name='customer1', type='test'),
+              utype(id=1, name='customer1', type='test'),
+              utype(id=2, name='customer2', type='test'),
+              utype(id=4, name='customer2', type='test')
+          ]))
+
+  def test_ReadGbq_export_with_project(self):
+    from apache_beam.dataframe import convert
+    the_table = bigquery_tools.BigQueryWrapper().get_table(
+        project_id="apache-beam-testing",
+        dataset_id="beam_bigquery_io_test",
+        table_id="dfsqltable_3c7d6fd5_16e0460dfd0")
+    table = the_table.schema
+    utype = bigquery_schema_tools. \
+        generate_user_type_from_bq_schema(table)
+    with beam.Pipeline(argv=self.args) as p:
+      actual_df = p | apache_beam.dataframe.io.read_gbq(
+          table="dfsqltable_3c7d6fd5_16e0460dfd0",
+          dataset="beam_bigquery_io_test",
+          project_id="apache-beam-testing",
+          use_bqstorage_api=False)
+      assert_that(
+          convert.to_pcollection(actual_df),
+          equal_to([
+              utype(id=3, name='customer1', type='test'),
+              utype(id=1, name='customer1', type='test'),
+              utype(id=2, name='customer2', type='test'),
+              utype(id=4, name='customer2', type='test')
+          ]))
+
+  @pytest.mark.it_postcommit
+  def test_ReadGbq_direct_read(self):
+    from apache_beam.dataframe import convert
+    the_table = bigquery_tools.BigQueryWrapper().get_table(
+        project_id="apache-beam-testing",
+        dataset_id="beam_bigquery_io_test",
+        table_id="dfsqltable_3c7d6fd5_16e0460dfd0")
+    table = the_table.schema
+    utype = bigquery_schema_tools. \
+          generate_user_type_from_bq_schema(table)
+    with beam.Pipeline(argv=self.args) as p:
+      actual_df = p | apache_beam.dataframe.io.\
+          read_gbq(
+          table=
+          "apache-beam-testing:beam_bigquery_io_test."
+          "dfsqltable_3c7d6fd5_16e0460dfd0",
+          use_bqstorage_api=True)
+      assert_that(
+          convert.to_pcollection(actual_df),
+          equal_to([
+              utype(id=3, name='customer1', type='test'),
+              utype(id=1, name='customer1', type='test'),
+              utype(id=2, name='customer2', type='test'),
+              utype(id=4, name='customer2', type='test')
+          ]))
+
+  @pytest.mark.it_postcommit
+  def test_ReadGbq_direct_read_with_project(self):
+    from apache_beam.dataframe import convert
+    the_table = bigquery_tools.BigQueryWrapper().get_table(
+        project_id="apache-beam-testing",
+        dataset_id="beam_bigquery_io_test",
+        table_id="dfsqltable_3c7d6fd5_16e0460dfd0")
+    table = the_table.schema
+    utype = bigquery_schema_tools. \
+        generate_user_type_from_bq_schema(table)
+    with beam.Pipeline(argv=self.args) as p:
+      actual_df = p | apache_beam.dataframe.io.read_gbq(
+          table="dfsqltable_3c7d6fd5_16e0460dfd0",
+          dataset="beam_bigquery_io_test",
+          project_id="apache-beam-testing",
+          use_bqstorage_api=True)
+      assert_that(
+          convert.to_pcollection(actual_df),
+          equal_to([
+              utype(id=3, name='customer1', type='test'),
+              utype(id=1, name='customer1', type='test'),
+              utype(id=2, name='customer2', type='test'),
+              utype(id=4, name='customer2', type='test')
+          ]))

Review Comment:
   It could be good to add one more test that actually applies some computation 
on the dataframe, just to test it end-to-end e.g:
   
   ```
   beam_df = read_gbq(...)
   actual_df = beam_df.groupby('name').count()
   ```
   
   (Note in that case the 'name' column will be in the index, so you'll have to 
use `to_pcollection(actual_df, include_indexes=True)`)
   
   Alternatively we could build a nice example pipeline that uses read_gbq with 
a computation and put it in `apache_beam.examples.dataframe`.



##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -58,6 +58,28 @@
 _DEFAULT_BYTES_CHUNKSIZE = 1 << 20
 
 
+def read_gbq(
+    table, dataset=None, project_id=None, use_bqstorage_api=False, **kwargs):
+  """This function reads data from a BigQuery source and outputs it into
+  a Beam deferred dataframe
+  (https://beam.apache.org/documentation/dsls/dataframes/overview/)
+  Please specify a table in the format 'PROJECT:dataset.table'
+  or use the table, dataset, and project_id args
+  to specify the table. If you would like to utilize the BigQuery
+  Storage API in ReadFromBigQuery,
+    please set use_bq_storage_api to True.
+    Otherwise, please set the flag to false or
+    leave it unspecified."""

Review Comment:
   nit: Could you format this with "Args:" instead?
   ```suggestion
     Args:
       table: ...
       dataset: ...
       project_id: ...
       use_bqstorage_api: ...
   ```



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""
+# pytype: skip-file
+
+import logging
+import unittest
+
+import pytest
+
+import apache_beam as beam
+import apache_beam.io.gcp.bigquery
+from apache_beam.io.gcp import bigquery_read_it_test
+from apache_beam.io.gcp import bigquery_schema_tools
+from apache_beam.io.gcp import bigquery_tools
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+# Protect against environments where bigquery library is not available.
+# pylint: disable=wrong-import-order, wrong-import-position
+# pylint: enable=wrong-import-order, wrong-import-position
+
+_LOGGER = logging.getLogger(__name__)
+
+
+class 
ReadUsingReadGbqTests(bigquery_read_it_test.BigQueryReadIntegrationTests):

Review Comment:
   Why is this inheriting from `BigQueryReadIntegrationTests`?



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""
+# pytype: skip-file
+
+import logging
+import unittest
+
+import pytest
+
+import apache_beam as beam
+import apache_beam.io.gcp.bigquery
+from apache_beam.io.gcp import bigquery_read_it_test
+from apache_beam.io.gcp import bigquery_schema_tools
+from apache_beam.io.gcp import bigquery_tools
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+# Protect against environments where bigquery library is not available.
+# pylint: disable=wrong-import-order, wrong-import-position
+# pylint: enable=wrong-import-order, wrong-import-position

Review Comment:
   These comments look out of place



##########
sdks/python/apache_beam/dataframe/io_test.py:
##########
@@ -410,5 +419,50 @@ def test_double_write(self):
                           set(self.read_all_lines(output + 'out2.csv*')))
 
 
[email protected](HttpError is None, 'GCP dependencies are not installed')

Review Comment:
   Could you do an experiment to see what the error looks like if you try to 
use `read_gbq` without gcp dependencies installed? It should be sufficient to 
just make a new virtualenv, and install beam with `python -m pip install -e 
'.[test,dataframe]'` (note there's no gcp extra).
   
   I'd like to make sure that the user gets a helpful error directing them to 
install GCP deps.
   
   Even better would be to add a test that confirms this, it could be skipped 
unless GCP deps are _not_ installed.



##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -58,6 +58,28 @@
 _DEFAULT_BYTES_CHUNKSIZE = 1 << 20
 
 
+def read_gbq(
+    table, dataset=None, project_id=None, use_bqstorage_api=False, **kwargs):
+  """This function reads data from a BigQuery source and outputs it into
+  a Beam deferred dataframe
+  (https://beam.apache.org/documentation/dsls/dataframes/overview/)

Review Comment:
   ```suggestion
     """This function reads data from a BigQuery table and produces a 
:class:`~apache_beam.dataframe.frames.DeferredDataFrame`.
   ```
   
   (this will make a link in the generated documentation, like you see 
[here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#module-apache_beam.dataframe.io))



##########
sdks/python/apache_beam/io/gcp/dataframe_io_it_test.py:
##########
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Unit tests for Dataframe sources and sinks."""

Review Comment:
   Another nit: I'd prefer we put this in `apache_beam.dataframe.io_it_test`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22616: Add initial read_gbq wrapper

Reply via email to