[ 
https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818
 ] 

Brian Hulette commented on BEAM-11587:
--------------------------------------

Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually 
using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and 
to_csv), is more trouble than it's worth. The reason being that the files read 
by read_csv (and other read_* functions) are relatively easy to split up into 
partitions for reading from distributed worker nodes. But splitting up a 
BigQuery read is more complicated, and we'd need to implement a bunch of logic 
for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for 
BigQueryIO is that it only produces/consumes dictionaries, while we need to get 
a schema for it to use it with the DataFrame API (that's why we need to specify 
the schema manually in [this 
example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with 
BigQueryIO (this would mean looking up the schema in BQ, then adding logic to 
the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the 
hood. These methods would still be nice to have so users familiar with 
DataFrames don't need to use classic Beam at all.

> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
>                 Key: BEAM-11587
>                 URL: https://issues.apache.org/jira/browse/BEAM-11587
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-dataframe, io-py-gcp, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for 
> querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>  
> and 
>  
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to