[jira] [Commented] (NIFI-14453) PutBigQuery Creates too many active streams

Pierre Villard (Jira) Wed, 09 Apr 2025 06:55:06 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17942204#comment-17942204
 ]


Pierre Villard commented on NIFI-14453:
---------------------------------------

Hi [~svadapalli] - I looked into it and there is no easy solution that would 
not require a significant refactor of the processor. Here is a summary:

Because we want to support exactly-once semantics, we create our own streams in 
the processor instead of using the default stream (which only provides 
at-least-once semantics). The problem is that we currently create one stream 
per flowfile which is definitely not great if you are processing many flowfiles 
per second as you'd quickly hit some quotas. An option would be to not close 
the stream and reuse the stream for many flowfiles but the complexity comes 
from the fact that we support expression language for the dataset and table 
name... so every flowfile can potentially go to a different destination, so we 
would have to manage many streams properly in the processor.

For anyone willing to spend time on this, I see two options:
 * provide another transfer type option: "Stream with at least once". And just 
use the default stream with at least once semantics. No need to create many 
streams.
 * implement proper management of streams in the processor with a map of table 
name / stream and reuse the streams across flowfiles with proper management of 
the offsets

The immediate mitigation for the problem is to use a MergeRecord processor 
before the PutBigQuery processor to merge together many flow files and have 
larger flowfiles with many records. Proper configuration would need to be 
provided in terms of merging to stay below 10000 flow files per hour which is 
one of the mentioned quota ("10,000 streams every hour, per project per 
region").

There is still a change that I'm going to submit via a pull request which is to 
force the finalize on the streams. This is, in theory, optional in the 
streaming mode as streams are closed based on TTL but right now the stream 
would close itself only if no data is sent through the stream for 3 days... so 
I think this is definitely not great and it could be the reason for hitting 
another quota...

So while not perfect the PR may help in your situation. I'd still recommend 
using MergeRecord processor to limit the number of flow files being processed 
by the processor.

> PutBigQuery Creates too many active streams
> -------------------------------------------
>
>                 Key: NIFI-14453
>                 URL: https://issues.apache.org/jira/browse/NIFI-14453
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 2.3.0
>            Reporter: Satya Vadapalli
>            Assignee: Pierre Villard
>            Priority: Major
>
> *The Issue:*
> We recently migrated from Nifi 1.x to 2.x. We've been having an issue with 
> PutBigQuery processor. It used to work well on the older 
> version(PutBigQueryStreaming), because I believe it used the BigQuery REST 
> API vs the new one uses the Storage API. I'm running into an issue where the 
> processor is opening over 10k streams, instead of reusing the existing 
> stream. Here's the error message. 
>  
> {code:java}
> PutBigQuery[id=01bc2d0f-0196-1000-0000-0000541e88b5] Processing halted: 
> yielding [1 sec]: com.google.api.gax.rpc.FailedPreconditionException: 
> io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Table has too many 
> active streams with too little traffic. Please send more traffic through 
> existing streams or finalize unused streams. 
> Table=1087270590813:xxxxxx.xxxxxxxx. ActiveStreamCount=10139. 
> ActualPerStreamBytesPerSec=0.0333333. 
> RequiredPerStreamBytesPerSecForMoreStreams=300000. If you have already 
> terminated all the traffic, the error will go away in two hours. To avoid 
> this problem in the long term, please use less streams for the same amount of 
> data Entity: projects/xxx/datasets/xxx/tables/xxx - Caused by: 
> io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Table has too many 
> active streams with too little traffic. Please send more traffic through 
> existing streams or finalize unused streams. 
> Table=1087270590813:xxxxxxx.xxxxxxxxxx. ActiveStreamCount=10139. 
> ActualPerStreamBytesPerSec=0.0333333. 
> RequiredPerStreamBytesPerSecForMoreStreams=300000. If you have already 
> terminated all the traffic, the error will go away in two hours. To avoid 
> this problem in the long term, please use less streams for the same amount of 
> data Entity: projects/xxx/datasets/xxx/tables/xxx
> {code}
> *What is Expected:*
> The PutBigQuery should be able to stream data into Bigquery, with a single 
> stream, using the Storage Write API as described in the document - 
> [https://cloud.google.com/bigquery/docs/write-api-streaming]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-14453) PutBigQuery Creates too many active streams

Reply via email to