[
https://issues.apache.org/jira/browse/NIFI-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17942204#comment-17942204
]
Pierre Villard commented on NIFI-14453:
---------------------------------------
Hi [~svadapalli] - I looked into it and there is no easy solution that would
not require a significant refactor of the processor. Here is a summary:
Because we want to support exactly-once semantics, we create our own streams in
the processor instead of using the default stream (which only provides
at-least-once semantics). The problem is that we currently create one stream
per flowfile which is definitely not great if you are processing many flowfiles
per second as you'd quickly hit some quotas. An option would be to not close
the stream and reuse the stream for many flowfiles but the complexity comes
from the fact that we support expression language for the dataset and table
name... so every flowfile can potentially go to a different destination, so we
would have to manage many streams properly in the processor.
For anyone willing to spend time on this, I see two options:
* provide another transfer type option: "Stream with at least once". And just
use the default stream with at least once semantics. No need to create many
streams.
* implement proper management of streams in the processor with a map of table
name / stream and reuse the streams across flowfiles with proper management of
the offsets
The immediate mitigation for the problem is to use a MergeRecord processor
before the PutBigQuery processor to merge together many flow files and have
larger flowfiles with many records. Proper configuration would need to be
provided in terms of merging to stay below 10000 flow files per hour which is
one of the mentioned quota ("10,000 streams every hour, per project per
region").
There is still a change that I'm going to submit via a pull request which is to
force the finalize on the streams. This is, in theory, optional in the
streaming mode as streams are closed based on TTL but right now the stream
would close itself only if no data is sent through the stream for 3 days... so
I think this is definitely not great and it could be the reason for hitting
another quota...
So while not perfect the PR may help in your situation. I'd still recommend
using MergeRecord processor to limit the number of flow files being processed
by the processor.
> PutBigQuery Creates too many active streams
> -------------------------------------------
>
> Key: NIFI-14453
> URL: https://issues.apache.org/jira/browse/NIFI-14453
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 2.3.0
> Reporter: Satya Vadapalli
> Assignee: Pierre Villard
> Priority: Major
>
> *The Issue:*
> We recently migrated from Nifi 1.x to 2.x. We've been having an issue with
> PutBigQuery processor. It used to work well on the older
> version(PutBigQueryStreaming), because I believe it used the BigQuery REST
> API vs the new one uses the Storage API. I'm running into an issue where the
> processor is opening over 10k streams, instead of reusing the existing
> stream. Here's the error message.
>
> {code:java}
> PutBigQuery[id=01bc2d0f-0196-1000-0000-0000541e88b5] Processing halted:
> yielding [1 sec]: com.google.api.gax.rpc.FailedPreconditionException:
> io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Table has too many
> active streams with too little traffic. Please send more traffic through
> existing streams or finalize unused streams.
> Table=1087270590813:xxxxxx.xxxxxxxx. ActiveStreamCount=10139.
> ActualPerStreamBytesPerSec=0.0333333.
> RequiredPerStreamBytesPerSecForMoreStreams=300000. If you have already
> terminated all the traffic, the error will go away in two hours. To avoid
> this problem in the long term, please use less streams for the same amount of
> data Entity: projects/xxx/datasets/xxx/tables/xxx - Caused by:
> io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Table has too many
> active streams with too little traffic. Please send more traffic through
> existing streams or finalize unused streams.
> Table=1087270590813:xxxxxxx.xxxxxxxxxx. ActiveStreamCount=10139.
> ActualPerStreamBytesPerSec=0.0333333.
> RequiredPerStreamBytesPerSecForMoreStreams=300000. If you have already
> terminated all the traffic, the error will go away in two hours. To avoid
> this problem in the long term, please use less streams for the same amount of
> data Entity: projects/xxx/datasets/xxx/tables/xxx
> {code}
> *What is Expected:*
> The PutBigQuery should be able to stream data into Bigquery, with a single
> stream, using the Storage Write API as described in the document -
> [https://cloud.google.com/bigquery/docs/write-api-streaming]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)