[ 
https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818
 ] 

Brian Hulette edited comment on BEAM-11587 at 3/2/22, 12:20 AM:
----------------------------------------------------------------

Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually 
using the pandas read_gbq and to_gbq (in the same way that we use read_csv and 
to_csv), is more trouble than it's worth. The reason being that the files read 
by read_csv (and other read_* functions) are relatively easy to split up into 
partitions for reading from distributed worker nodes. But splitting up a 
BigQuery read is more complicated, and we'd need to implement a bunch of logic 
for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for 
BigQueryIO is that it only produces/consumes dictionaries, while we need to get 
a schema for it to use it with the DataFrame API (that's why we need to specify 
the schema manually in [this 
example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with 
BigQueryIO (this would mean looking up the schema in BQ, then adding logic to 
the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the 
hood. These methods would still be nice to have so users familiar with 
DataFrames don't need to use classic Beam at all.


was (Author: bhulette):
Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually 
using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and 
to_csv), is more trouble than it's worth. The reason being that the files read 
by read_csv (and other read_* functions) are relatively easy to split up into 
partitions for reading from distributed worker nodes. But splitting up a 
BigQuery read is more complicated, and we'd need to implement a bunch of logic 
for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for 
BigQueryIO is that it only produces/consumes dictionaries, while we need to get 
a schema for it to use it with the DataFrame API (that's why we need to specify 
the schema manually in [this 
example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with 
BigQueryIO (this would mean looking up the schema in BQ, then adding logic to 
the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the 
hood. These methods would still be nice to have so users familiar with 
DataFrames don't need to use classic Beam at all.

> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
>                 Key: BEAM-11587
>                 URL: https://issues.apache.org/jira/browse/BEAM-11587
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-dataframe, io-py-gcp, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for 
> querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>  
> and 
>  
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to