[
https://issues.apache.org/jira/browse/FLINK-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123002#comment-16123002
]
ASF GitHub Bot commented on FLINK-7407:
---------------------------------------
GitHub user tzulitai opened a pull request:
https://github.com/apache/flink/pull/4526
[FLINK-7407] [kafka] Adapt AbstractPartitionDiscoverer to handle
non-contiguous partition metadata
## What is the purpose of the change
Previously, the `AbstractPartitionDiscoverer` tracked discovered partitions
by keeping only the largest discovered partition id. All fetched partition
metadata with ids smaller than this id would be
considered as discovered. This assumption of contiguous partition ids is
too naive for corner cases where there may be undiscovered partitions that were
temporarily unavailable before and were later on always shadowed by discovered
partitions with larger partition ids.
## Brief change log
- Change the use of `Map<String, Integer> topicToLargestDiscoveredId` to a
simple `Set<KafkaTopicPartition>` to track already discovered partitions.
- Minor `hotfix` to remove unused method in `AbstractPartitionDiscoverer`.
## Verifying this change
This change is verified by a new
`AbstractPartitionDiscovererTest.testNonContiguousPartitionIdDiscovery` test.
The test features the case where fetched partition metadata is non-contiguous
and may have smaller missing partition ids.
Other aspects should be covered by existing tests in
`AbstractPartitionDiscovererTest`.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): **no**
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: **no**
- The serializers: **no**
- The runtime per-record code paths (performance sensitive): **no**
- Anything that affects deployment or recovery: **no**
## Documentation
- Does this pull request introduce a new feature? **no**
- If yes, how is the feature documented? **not applicable**
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tzulitai/flink FLINK-7407
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4526.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4526
----
commit b29bcb4a19268a1ad212b85cb7de633f23c0a3c6
Author: Tzu-Li (Gordon) Tai <[email protected]>
Date: 2017-08-11T07:53:46Z
[FLINK-7407] [kafka] Adapt AbstractPartitionDiscoverer to handle
non-contiguous partition metadata
Previously, the AbstractPartitionDiscoverer tracked discovered
partitions by keeping only the largest discovered partition id. All
fetched partition metadata with ids smaller than this id would be
considered as discovered. This assumption of contiguous partition ids is
too naive for corner cases where there may be undiscovered partitions
that were temporariliy unavilable before and were shadowed by
discoverered partitions with largerer partition ids.
This commit changes to use a set to track seen partitions. This also
removes the need of pre-sorting fetched partitions.
commit 2dd84c8ba468fcf8552ce8f6a6f4d5c4eb7e4a10
Author: Tzu-Li (Gordon) Tai <[email protected]>
Date: 2017-08-11T07:57:52Z
[hotfix] [kafka] Remove unused shouldAssignToThisSubtask method in
AbstractPartitionDiscoverer
----
> Assumption of partition id strict contiguity is too naive in Kafka consumer's
> AbstractPartitionDiscoverer
> ---------------------------------------------------------------------------------------------------------
>
> Key: FLINK-7407
> URL: https://issues.apache.org/jira/browse/FLINK-7407
> Project: Flink
> Issue Type: Improvement
> Components: Kafka Connector
> Affects Versions: 1.4.0
> Reporter: Tzu-Li (Gordon) Tai
> Assignee: Tzu-Li (Gordon) Tai
> Priority: Blocker
> Fix For: 1.4.0
>
>
> In the Kafka Consumer's {{AbstractPartitionDiscoverer}}, for partition
> discovery, already discovered partitions are tracked with the following map:
> {code}
> Map<String, Integer> topicsToLargestDiscoveredPartitionId
> {code}
> Simply put, on each discovery attempt's metadata fetch, all partition ids of
> a given topic that are smaller than the largest seen id will be ignored and
> not assigned. This approach lies on the assumption that fetched partition ids
> of a single topic are always strictly contiguous starting from 0.
> This assumption may be too naive, in that partitions which were temporarily
> unavailable at the time of a discovery would be shadowed by available
> partitions with larger ids, and from then on would be left unassigned.
> We should redesign how the {{AbstractPartitionDiscoverer}} tracks discovered
> partitions by not relying on the contiguity assumption, and also add test
> cases for non-contiguous fetched partition ids.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)