[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8742: Null handling in SELECT, ORDER BY, DISTINCT, HAVING, GROUP BY, and few agg. functions and few data types

GitBox Fri, 20 May 2022 23:42:32 -0700


Jackie-Jiang commented on code in PR #8742:
URL: https://github.com/apache/pinot/pull/8742#discussion_r878653273



##########
pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java:
##########
@@ -76,6 +77,8 @@ byte[] toBytes()
 
   String[] getStringArray(int rowId, int colId);
 
+  MutableRoaringBitmap getColumnNullBitmap(int colId);

Review Comment:
   (minor) Suggest renaming to `getNullRowIds()`



##########
pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java:
##########
@@ -173,6 +188,25 @@ public byte[] toBytes()
       dataOutputStream.writeInt(bytes.length);
       dataOutputStream.write(bytes);
     }
+
+    // Write colum field specs.
+    // todo: consider serializing only defaultNullValues since this is the 
only required info.
+    int fieldSpecsSize = _columnFieldSpecs == null ? 0 : 
_columnFieldSpecs.length;
+    dataOutputStream.writeInt(fieldSpecsSize);

Review Comment:
   (MAJOR) This is backward incompatible change. The new broker won't 
understand the data schema returned from the old servers



##########
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/DistinctDataTableReducer.java:
##########
@@ -99,11 +99,11 @@ private ResultTable reduceToResultTable(DistinctTable 
distinctTable) {
     Iterator<Record> iterator = distinctTable.getFinalResult();
     while (iterator.hasNext()) {
       Object[] values = iterator.next().getValues();
-      Object[] row = new Object[numColumns];
+      Object[] rowValues = new Object[numColumns];

Review Comment:
   (minor) Seems unrelated



##########
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/GroupByDataTableReducer.java:
##########
@@ -136,23 +137,23 @@ private void reduceToResultTable(BrokerResponseNative 
brokerResponseNative, Data
     if (havingFilter != null) {
       HavingFilterHandler havingFilterHandler = new 
HavingFilterHandler(havingFilter, postAggregationHandler);
       while (rows.size() < limit && sortedIterator.hasNext()) {
-        Object[] row = sortedIterator.next().getValues();
-        extractFinalAggregationResults(row);
+        Object[] values = sortedIterator.next().getValues();

Review Comment:
   (minor) Suggest reverting it. I feel `row` is more specific, `values` is too 
generic



##########
pinot-core/src/main/java/org/apache/pinot/core/query/distinct/raw/BaseRawIntSingleColumnDistinctExecutor.java:
##########
@@ -29,35 +28,34 @@
 import org.apache.pinot.core.data.table.Record;
 import org.apache.pinot.core.query.distinct.DistinctExecutor;
 import org.apache.pinot.core.query.distinct.DistinctTable;
-import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.FieldSpec;
 
 
 /**
  * Base implementation of {@link DistinctExecutor} for single raw INT column.
  */
 abstract class BaseRawIntSingleColumnDistinctExecutor implements 
DistinctExecutor {
   final ExpressionContext _expression;
-  final DataType _dataType;
+  final FieldSpec _fieldSpec;
   final int _limit;
 
-  final IntSet _valueSet;
+  final ObjectSet<Integer> _valueSet;

Review Comment:
   This can cause performance regression. We should avoid adding overhead to 
existing use cases. If we want to return `null` as a separate distinct value, 
we can add a boolean flag `_hasNull`



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BigDecimalColumnPreIndexStatsCollector.java:
##########
@@ -112,8 +113,13 @@ public int getCardinality() {
   }
 
   @Override
-  public boolean hasNull() {
-    return false;
+  public boolean hasNulls() {

Review Comment:
   This API is not really used. We should rely on whether null value vector 
exist and non-empty to identify whether a column contains null values



##########
pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java:
##########
@@ -46,6 +49,7 @@
 public class DataSchema {
   private final String[] _columnNames;
   private final ColumnDataType[] _columnDataTypes;
+  private final FieldSpec[] _columnFieldSpecs;

Review Comment:
   Trying to understand why we need `FieldSpec` in the data schema. Seems the 
intention is to use it to calculate the default null value.
   It should not be stored here for the following reasons:
   1. We don't want to carry the field spec in the transport object
   2. The column could be an expression (transform function/aggregation 
function etc), and there is no existing `FieldSpec` for them
   3. We should not need to calculate default null value on the fly. Filling 
any value should be fine because we should skip processing them



##########
pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java:
##########
@@ -102,11 +110,17 @@ public DataTableBuilder(DataSchema dataSchema) {
   }
 
   public static DataTable getEmptyDataTable() {
-    return _version == VERSION_2 ? new DataTableImplV2() : new 
DataTableImplV3();
+    if (_version == VERSION_2) {
+      return new DataTableImplV2();
+    }
+    if (_version == VERSION_4) {

Review Comment:
   (minor) Suggest checking `VERSION_3` for readability, same for `build()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8742: Null handling in SELECT, ORDER BY, DISTINCT, HAVING, GROUP BY, and few agg. functions and few data types

Reply via email to