[
https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=330576&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330576
]
ASF GitHub Bot logged work on BEAM-8382:
----------------------------------------
Author: ASF GitHub Bot
Created on: 18/Oct/19 15:39
Start Date: 18/Oct/19 15:39
Worklog Time Spent: 10m
Work Description: aromanenko-dev commented on issue #9765: [BEAM-8382]
Add polling interval to KinesisIO.Read
URL: https://github.com/apache/beam/pull/9765#issuecomment-543801925
@jfarr Regarding backoff strategy - I think it's a good way in case if your
backend doesn't provide any statistics in advance at which rate you have to
perform next queries. We know only maximum limits and the result of last
`kinesis.getRecords()` execution. So, starting with recommended default value
and use proper backoff strategy in case of failures it could be optimal
solution when you need to limit rate of requests. Otherwise, there are big
chances that backend resources won't be utilised efficiently. `FluentBackoff`
is used in many Beam IOs for similar reasons.
I'd like to ask you questions about your use case. Let's suppose we can have
ability set static polling interval with `withPollingInterval(Duration)` (as
implemented in current PR). What would be default value in your case? 1 sec?
What would you do in case if it won't be enough for already running pipeline?
Would your need to change this value and restart pipeline manually?
Btw, default delay of 1 sec looks quite pessimistic since according to AWS
doc `each shard can support up to five read transactions per second.` So, in
case of one consumer per shard (I can guess this is a quite common case) it
should be 200 ms as default value.
PS: The goal of minimizing the number of tuning knobs is to reduce the
number of possible config combinations (which grows exponentially) in case if
it's possible to configure this automatically [1]. If it's not possible then
the better solution will be to provide flexible API, especially for unbounded
sources (as Kinesis, Kafka, etc), that are being used in long-running
pipelines.
[1] https://beam.apache.org/contribute/ptransform-style-guide/#configuration
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 330576)
Time Spent: 2h (was: 1h 50m)
> Add polling interval to KinesisIO.Read
> --------------------------------------
>
> Key: BEAM-8382
> URL: https://issues.apache.org/jira/browse/BEAM-8382
> Project: Beam
> Issue Type: Improvement
> Components: io-java-kinesis
> Affects Versions: 2.13.0, 2.14.0, 2.15.0
> Reporter: Jonothan Farr
> Assignee: Jonothan Farr
> Priority: Major
> Time Spent: 2h
> Remaining Estimate: 0h
>
> With the current implementation we are observing Kinesis throttling due to
> ReadProvisionedThroughputExceeded on the order of hundreds of times per
> second, regardless of the actual Kinesis throughput. This is because the
> ShardReadersPool readLoop() method is polling getRecords() as fast as
> possible.
> From the KDS documentation:
> {quote}Each shard can support up to five read transactions per second.
> {quote}
> and
> {quote}For best results, sleep for at least 1 second (1,000 milliseconds)
> between calls to getRecords to avoid exceeding the limit on getRecords
> frequency.
> {quote}
> [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html]
> [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)