somandal commented on code in PR #9117:
URL: https://github.com/apache/pinot/pull/9117#discussion_r932717509
##########
pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java:
##########
@@ -123,7 +123,12 @@ enum MetadataKey {
NUM_SEGMENTS_PRUNED_BY_LIMIT("numSegmentsPrunedByLimit",
MetadataValueType.INT),
NUM_SEGMENTS_PRUNED_BY_VALUE("numSegmentsPrunedByValue",
MetadataValueType.INT),
EXPLAIN_PLAN_NUM_EMPTY_FILTER_SEGMENTS("explainPlanNumEmptyFilterSegments",
MetadataValueType.INT),
-
EXPLAIN_PLAN_NUM_MATCH_ALL_FILTER_SEGMENTS("explainPlanNumMatchAllFilterSegments",
MetadataValueType.INT);
+
EXPLAIN_PLAN_NUM_MATCH_ALL_FILTER_SEGMENTS("explainPlanNumMatchAllFilterSegments",
MetadataValueType.INT),
+ // TODO: Add a mechanism to pass generic metadata without adding them to
the DataTable's metadata
+ // This metadata field is only required for emitting a metric to identify
whether queries using MV columns (as
+ // identifiers or via transform) are present in the Selection Order-By
queries. This was the only place where
+ // such a metadata could be added in the existing interfaces even though
we don't need this on the Broker side.
+ QUERY_HAS_MV_SELECTION_ORDER_BY("queryHasMVSelectionOrderBy",
MetadataValueType.STRING);
Review Comment:
hey @Jackie-Jiang, @siddharthteotia and I discussed the `ServerMetrics`
change and do think it is a good idea in general. One problem is that we need
an additional way to correlate a query with this metric, and for that purpose
we actually added a log in `QueryScheduler` in this existing PR to capture the
value of this flag. This is especially important for high QPS use cases where
we may need to comb through hundreds of logs on the Broker side to find the
possible offending queries. In addition to that, it may not be desirable to
emit a metric for every segment (since the operators are at a segment level)
and instead just emit once per query.
Thinking some more we have a couple of ideas to address this (have both a
log and a metric and avoid doing this for every segment belonging to the query):
Option 1 (preferred method):
- Modify the `QueryContext` to also contain the request ID from the
`ServerQueryRequest`. This will then be available throughout all the plan nodes
and operators. We can then use this request ID for logging purposes.
- To avoid emitting a metric and log in the `SelectionOrderByOperator` but
instead do this at the `InstanceResponseOperator` level where we will have the
combined `IntermediateResultsBlock` available. This will allow us to emit the
metric and log only once per query per server. The `SelectionOrderByOperator`
will still detect this scenario and set it in the `IntermediateResultsBlock` as
is being done in the current PR.
- For now we can avoid modifying all the operators and plan nodes to take
the `ServerMetrics` as input and instead only pass it down up to the
`InstanceResponsePlanNode` and `InstanceResponseOperator`. In the future it
should be trivial to extend all operators and plan nodes to have the
`ServerMetrics` if the need arises.
Option 2
- Add a transient metadata object to the `IntermediateResultsBlock` and
potentially `DataTable` where we can store all the transient metadata fields
which aren't serializable and will be ignored when creating the serialized
`DataTable`. The drawback with this approach is that certain operators like
Filter don't use `IntermediateResultsBlock` directly and we'll need to add
getters/setters to all the intermediate block types and copy these over. We
still only want to emit the metric and log just once per query.
Let us know your thoughts @Jackie-Jiang
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]