chenboat commented on PR #18433: URL: https://github.com/apache/pinot/pull/18433#issuecomment-4580920166
> > I do not understand this example and the argument. The partitions 0 and 10000 belong to two different streams. Although both of them is the partition 0 of their streams, why do they need to be colocated? > > Good question - let me clarify the colocation argument. > > In this specific setup, both streams are co-partitioned by the same key (trace_id). That means stream 0 partition 0 and stream 1 partition 0 contain data for the same set of trace IDs (those where trace_id % 3 == 0). Colocating them on the same server means a query filtering by a specific trace_id can be served entirely locally without scatter-gathering across multiple server groups. > > That said, you're right that if the two streams have no relationship between their partition keys, colocation across streams wouldn't be a requirement. > > But more fundamentally, the fix is necessary for correctness of instance assignment even independent of the colocation argument. With numPartitions: 3 configured, the intent is: > > * stream partition 0 → instance group 0 > * stream partition 1 → instance group 1 > * stream partition 2 → instance group 2 > > Without the fix, stream 1's segments get assigned via the raw Pinot partition ID: > > * 10000 % 3 = 1 → instance group 1 (wrong, should be 0) > * 10001 % 3 = 2 → instance group 2 (wrong, should be 1) > * 10002 % 3 = 0 → instance group 0 (wrong, should be 2) In the above example, the current assignment will assign in a round-robin manner to balance segment placement. Why 10000%3 = 1 is a wrong assignment result? > > This produces an arbitrary and scrambled mapping that doesn't match what the user configured at all — segments from stream 1 would be distributed across servers in a way that's inconsistent with the replicaGroupPartitionConfig. The fix ensures both streams use their stream-level partition ID consistently when computing the instance group. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
