eye-gu commented on issue #13811:
URL: https://github.com/apache/skywalking/issues/13811#issuecomment-4231841193

   > > 1. Shard tag and entity tag can differ, scattering the same entity 
across nodes.
   > > 
   > > Entity and ShardingKey are independent fields in Measure. When 
ShardingKey is set, it overrides the shard routing from Entity. For example, 
with entity.tag_names=["service_id"] and 
sharding_key.tag_names=["instance_id"], data for the same service_id lands on 
different shards/nodes under different instance_id values. Each node only sees 
a partial view of that entity.
   > 
   > No, ShardingKey and Entity are not independent. ShardingKey aims to 
enhance topn streaming performance and must adhere to the rule that the same 
entity always maps to the same node. Refer to the example I mentioned at 
[#12526](https://github.com/apache/skywalking/issues/12526). The OAP server 
follows the rule to set up the ShardingKey.
   > 
   > Your insight inspired me to add a validation step to enforce this implicit 
rule. If the end user sets them as your example, it will cause an unexpected 
result.
   > 
   > > 2. Even on a single node, agg=UNSPECIFIED still truncates incorrectly.
   > > 
   > > The coordinator sends agg=AGGREGATION_FUNCTION_UNSPECIFIED to data 
nodes, which prevents proper aggregation. For a COUNT TopN with TopN=2, a node 
holding entity-A(5 points), entity-B(3 points), entity-C(1 point) cannot 
compute COUNT(entity-A)=5. It simply truncates raw results by the TopN limit, 
returning incorrect partial data.
   > 
   > ## TopN Query Distribution and Sharding Logic
   > In BanyanDB, the current TopN query implementation pushes the aggregation 
functions directly to the data nodes rather than pruning them.
   > 
   > ### 1. Ad-hoc TopN Queries
   > During distributed analysis, the system determines whether to push down 
the logic based on the presence of an aggregate function:
   > 
   > Go
   > ```
   > // DistributedAnalyze converts logical expressions into an executable 
   > // operation tree represented by a Plan.
   > func DistributedAnalyze(criteria *measurev1.QueryRequest, ss 
[]logical.Schema) (logical.Plan, error) {
   >     // ...
   >     pushDownAgg := criteria.GetAgg() != nil
   >     plan := newUnresolvedDistributed(criteria, pushDownAgg)
   >     // ...
   > }
   > ```
   > 
   > If `criteria.GetAgg()` is not nil, the aggregation function is pushed down 
to the data nodes for execution.
   > 
   > ### 2. Pre-calculated TopN Streaming
   > If you are referring to pre-calculated TopN streaming rather than ad-hoc 
queries, the behavior relies on the `ShardingKey`. To maintain high 
performance, BanyanDB ensures that all data for a specific entity resides on 
the same node.
   > 
   > #### Comparison: Sharding Scenarios
   > Suppose we want to calculate Top 2 by Count for the entity set `Service + 
Instance`.
   > 
   > Scenario   Configuration   Data Distribution & Merging
   > A  No ShardingKey  Node A returns ServiceA(Inst1:5, Inst3:3).Node B 
returns ServiceA(Inst2:6, Inst4:1).The Liaison node must merge these results to 
output: ServiceA(Inst2:6, Inst1:5).
   > B  ShardingKey = Service   Node A contains all data for ServiceA and 
returns ServiceA(Inst2:6, Inst1:5)directly.Node B contains no data for ServiceA.
   > ### Design Principle
   > A core design principle of BanyanDB is to avoid distributing the same 
aggregation entity across different data nodes.By ensuring an entity's data is 
localized to a single node via the `ShardingKey`, we eliminate unnecessary 
network overhead and coordinator-side merging, significantly improving 
performance.
   
   1. Sorry for the noise! I wasn’t aware of this shard key limitation for 
measures, so this issue indeed doesn’t exist.
   2. For testing the distributed TopN query, my entry point was 
banyand/dquery/topn.go. However, at this entry point, there was no pushdown.
   
   
https://github.com/apache/skywalking-banyandb/blob/8105dfe1bd9787d8acdb5e9d9b780d85eb4db9a7/banyand/dquery/topn.go#L106-L108


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to