waitingF commented on code in PR #8376:
URL: https://github.com/apache/hudi/pull/8376#discussion_r1166212077
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java:
##########
@@ -148,9 +166,58 @@ public static OffsetRange[]
computeOffsetRanges(Map<TopicPartition, Long> fromOf
}
Review Comment:
@bvaradar I think the algorithm would not work well in data skew case.
In data skew case, it will not divvy partition evenly. For example, given
topic partitions "0:0->100, 1:0->500" and minPartitions=3, the algorithm will
generate 2 ranges: "0:0->100, 1:0->200, 1:200->300", for the 2 ranges of
partition 1, they are not divvied evenly. Given more skew partitions, it will
be worse.
In the data skew case, resplit will generate even ranges for one
TopicPartition. Because it will allocate ranges for topic partitions first,
then based on the allocated ranges resplit into roughly minPartitions ranges.
Based on this and the complex of the resplit should be very small, I think
resplit should be better.
How do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]