[
https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818
]
Brian Hulette commented on BEAM-11587:
--------------------------------------
Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually
using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and
to_csv), is more trouble than it's worth. The reason being that the files read
by read_csv (and other read_* functions) are relatively easy to split up into
partitions for reading from distributed worker nodes. But splitting up a
BigQuery read is more complicated, and we'd need to implement a bunch of logic
for it.
The traditional BigQueryIO already has all of this splitting logic. The gap for
BigQueryIO is that it only produces/consumes dictionaries, while we need to get
a schema for it to use it with the DataFrame API (that's why we need to specify
the schema manually in [this
example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).
A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with
BigQueryIO (this would mean looking up the schema in BQ, then adding logic to
the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the
hood. These methods would still be nice to have so users familiar with
DataFrames don't need to use classic Beam at all.
> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
> Key: BEAM-11587
> URL: https://issues.apache.org/jira/browse/BEAM-11587
> Project: Beam
> Issue Type: New Feature
> Components: dsl-dataframe, io-py-gcp, sdk-py-core
> Reporter: Brian Hulette
> Assignee: Svetak Vihaan Sundhar
> Priority: P3
> Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for
> querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>
> and
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html
--
This message was sent by Atlassian Jira
(v8.20.1#820001)