Jackie-Jiang commented on a change in pull request #5147: Support default 
star-tree
URL: https://github.com/apache/incubator-pinot/pull/5147#discussion_r399879683
 
 

 ##########
 File path: 
pinot-core/src/main/java/org/apache/pinot/core/startree/v2/builder/StarTreeV2BuilderConfig.java
 ##########
 @@ -58,6 +69,73 @@ public static StarTreeV2BuilderConfig 
fromIndexConfig(StarTreeIndexConfig indexC
     return builder.build();
   }
 
+  /**
+   * Generates default config based on the segment metadata.
+   * <ul>
+   *   <li>
+   *     All dictionary-encoded single-value dimensions (including date-time 
columns) with cardinality smaller or equal
+   *     to the threshold will be included in the split order, sorted by their 
cardinality in descending order
+   *   </li>
+   *   <li>Time column (if exists and dictionary-encoded) will be appended to 
the split order as the last element</li>
+   *   <li>Use COUNT(*) and SUM for all numeric metrics as function column 
pairs</li>
+   *   <li>Use default value for max leaf records</li>
+   * </ul>
+   */
+  public static StarTreeV2BuilderConfig 
generateDefaultConfig(SegmentMetadataImpl segmentMetadata) {
+    Schema schema = segmentMetadata.getSchema();
+    List<ColumnMetadata> dimensionColumnMetadataList = new ArrayList<>();
+    String timeColumn = null;
+    List<String> numericMetrics = new ArrayList<>();
+
+    for (FieldSpec fieldSpec : schema.getAllFieldSpecs()) {
+      if (!fieldSpec.isSingleValueField() || fieldSpec.isVirtualColumn()) {
+        continue;
+      }
+      String column = fieldSpec.getName();
+      switch (fieldSpec.getFieldType()) {
+        case DIMENSION:
+        case DATE_TIME:
+          ColumnMetadata columnMetadata = 
segmentMetadata.getColumnMetadataFor(column);
 
 Review comment:
   Here I assume time will be included in most queries and in the range filter 
or group by, so I decide to always include time column as the last dimension to 
split.
   For `DATE_TIME`, I assume the query pattern should be similar to other 
dimensions, so use the same rule for them.
   Updated the comments for this.
   
   Another way is to just treat all of them the same, but IMO always putting 
time column last should suit a wider range of use cases.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to