[jira] [Commented] (NIFI-11791) PutBigQuery processor lacks functionality found in PutBigQueryBatch

Marcio S. (Jira) Wed, 12 Jul 2023 09:48:10 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-11791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742489#comment-17742489
 ]


Marcio S. commented on NIFI-11791:
----------------------------------

[~pvillard], There are libraries (in Java and other languages) for several 
[BigQuery (Core) 
APIs|https://cloud.google.com/bigquery/docs/reference/libraries-overview], 
including:
 * [BigQuery 
API|https://cloud.google.com/java/docs/reference/google-cloud-bigquery/latest/overview],
 which _provides resources for creating, modifying, and deleting core resources 
such as datasets, tables, jobs, and routines._ I believe this is the one used 
by the now deprecated PutBigQueryBatch processor.
 * [BigQuery legacy streaming 
API|https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery], which 
I believe is used by the also deprecated PutBigQueryStreaming processor. Google 
recommends not to use it for new projects.
 * [BigQuery Storage 
API|https://cloud.google.com/java/docs/reference/google-cloud-bigquerystorage/latest/overview],
 which _exposes high throughput data reading for consumers who need to scan 
large volumes of managed data from their own applications and tools._ It's in 
fact two APIs: Storage Read API and Storage Write API. _The BigQuery Storage 
Write API is a unified data-ingestion API for BigQuery. It combines streaming 
ingestion and batch loading into a single high-performance API. You can use the 
Storage Write API to stream records into BigQuery in real time or to batch 
process an arbitrarily large number of records and commit them in a single 
atomic operation._ I believe this is the one used by the newer PutBigQuery 
processor.

The thing is, the BigQuery Storage API doesn't replace the BigQuery (Core) API. 
The former doesn't deal with jobs, for example. The latter does, and that's 
exactly the functionality one needs to do several things with BigQuery, 
including creating snapshot tables.

> PutBigQuery processor lacks functionality found in PutBigQueryBatch
> -------------------------------------------------------------------
>
>                 Key: NIFI-11791
>                 URL: https://issues.apache.org/jira/browse/NIFI-11791
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>    Affects Versions: 2.0.0, 1.22.0
>            Reporter: Marcio S.
>            Priority: Major
>
> Before PutBigQuery, we had PutBigQueryBatch and PutBigQueryStream, both now 
> deprecated. Not sure if PutBigQuery was designed to completely replace its 
> older brothers, but it cannot do that yet because of some missing features. 
> For example, we can't use PubBigQuery alone to create snapshot tables, 
> something that was easy to do with PutBigQueryBatch. 
> A snapshot table is a recent copy of a table from a database or a subset of 
> rows/columns of a table. It is used to dynamically replicate data between 
> distributed databases. Using PutBigQueryBatch, we can achieve that by setting 
> the following properties:
>  * Create Disposition = CREATE_IF_NEEDED
>  * Write Disposition = WRITE_TRUNCATE
> I understand that PutBigQuery uses the newer [BigQuery Storage Write 
> API|https://cloud.google.com/bigquery/docs/write-api], so adding the missing 
> functionality might not be possible. 
> But please note the older BigQuery (core) API (the one I believe 
> PutBigQueryBatch uses) allows the user to submit jobs to load data into 
> BigQuery in a very convenient way. That is something I'd like to see 
> preserved in future versions of NiFi



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-11791) PutBigQuery processor lacks functionality found in PutBigQueryBatch

Reply via email to