[
https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=331576&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-331576
]
ASF GitHub Bot logged work on BEAM-8382:
----------------------------------------
Author: ASF GitHub Bot
Created on: 21/Oct/19 18:40
Start Date: 21/Oct/19 18:40
Worklog Time Spent: 10m
Work Description: cmachgodaddy commented on issue #9765: [BEAM-8382] Add
polling interval to KinesisIO.Read
URL: https://github.com/apache/beam/pull/9765#issuecomment-544639586
@aromanenko-dev , I think we all understand it right, from what you
described above we just say it in different ways ;-) . In simple words, each
split, run on each worker, will create a number of threads base off num of
shards. That's means if we have 10 splits, we will have 10 guys/reader read one
shard (of course, it will read all shards)? Hope you agree with me this point?
Now, what Amazon recommend is we should have one client read one shard. Here is
what they say:
> Typically, when you use the KCL, you should ensure that the number of
instances does not exceed the number of shards (except for failure standby
purposes). Each shard is processed by exactly one KCL worker and has exactly
one corresponding record processor, so you never need multiple instances to
process one shard. However, one worker can process any number of shards, so
it's fine if the number of shards exceeds the number of instances.
And, here is why they do that,
https://docs.aws.amazon.com/streams/latest/dev/kinesis-low-latency.html
But, don't argue that here they use KCL, so pls consider one KCL is one
KinesisClient that our split use to connect to Kinesis and read a shard. And
don't misunderstand the `one worker can read a number of shard`. Here in their
context, or the example in their doc, their worker is an application that run
on EC2 instance, and when that scale this application, it will loadbalance the
KCL, e.g. two EC2 will have two KCL, each read two shards (assume a stream has
4 shards).
The point is we don't want to have number readers greater number of shards
in one application (or one pipeline in our context)? Imagine, if we have 10
pipelines deployed in our runners, and which has parallelism of 10, then we
will have 10 x 10 readers reading one shard ? And this 10 x 10 readers are not
loadbalanced since we are not using KCL? (even with KCL, this number still
exceed num of shard?). I am sure we will get "throughput exception" in the log
:-)
That's why in my subscribing POC (for the enhanced-fan-out), I design it in
a way that each split read only one shard (1:1).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 331576)
Time Spent: 4h (was: 3h 50m)
> Add polling interval to KinesisIO.Read
> --------------------------------------
>
> Key: BEAM-8382
> URL: https://issues.apache.org/jira/browse/BEAM-8382
> Project: Beam
> Issue Type: Improvement
> Components: io-java-kinesis
> Affects Versions: 2.13.0, 2.14.0, 2.15.0
> Reporter: Jonothan Farr
> Assignee: Jonothan Farr
> Priority: Major
> Time Spent: 4h
> Remaining Estimate: 0h
>
> With the current implementation we are observing Kinesis throttling due to
> ReadProvisionedThroughputExceeded on the order of hundreds of times per
> second, regardless of the actual Kinesis throughput. This is because the
> ShardReadersPool readLoop() method is polling getRecords() as fast as
> possible.
> From the KDS documentation:
> {quote}Each shard can support up to five read transactions per second.
> {quote}
> and
> {quote}For best results, sleep for at least 1 second (1,000 milliseconds)
> between calls to getRecords to avoid exceeding the limit on getRecords
> frequency.
> {quote}
> [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html]
> [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)