[GitHub] [pinot] gortiz commented on a diff in pull request #11453: Fix the data type handling in multi-stage engine

via GitHub Mon, 28 Aug 2023 23:36:29 -0700


gortiz commented on code in PR #11453:
URL: https://github.com/apache/pinot/pull/11453#discussion_r1308277764



##########
pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockUtils.java:
##########
@@ -275,43 +264,91 @@ private static Object[] extractRowFromDataBlock(DataBlock 
dataBlock, int rowId,
    * TODO: Add support for COLUMNAR format.
    * @return int array of values in the column
    */
-  public static int[] extractIntValuesForColumn(DataBlock dataBlock, int 
columnIndex) {
+  public static int[] extractIntValuesForColumn(DataBlock dataBlock, int 
colId) {
     DataSchema dataSchema = dataBlock.getDataSchema();
-    DataSchema.ColumnDataType[] columnDataTypes = 
dataSchema.getColumnDataTypes();
-
-    // Get null bitmap for the column.
-    RoaringBitmap nullBitmap = extractNullBitmaps(dataBlock)[columnIndex];
+    ColumnDataType storedType = 
dataSchema.getColumnDataType(colId).getStoredType();
+    RoaringBitmap nullBitmap = dataBlock.getNullRowIds(colId);
     int numRows = dataBlock.getNumberOfRows();
-
-    int[] rows = new int[numRows];
-    for (int rowId = 0; rowId < numRows; rowId++) {
-      if (nullBitmap != null && nullBitmap.contains(rowId)) {
-        continue;
+    int[] values = new int[numRows];
+    if (nullBitmap == null) {
+      switch (storedType) {
+        case INT:
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            values[rowId] = dataBlock.getInt(rowId, colId);
+          }
+          break;
+        case LONG:
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            values[rowId] = (int) dataBlock.getLong(rowId, colId);
+          }
+          break;
+        case FLOAT:
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            values[rowId] = (int) dataBlock.getFloat(rowId, colId);
+          }
+          break;
+        case DOUBLE:
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            values[rowId] = (int) dataBlock.getDouble(rowId, colId);
+          }
+          break;
+        case BIG_DECIMAL:
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            values[rowId] = dataBlock.getBigDecimal(rowId, colId).intValue();
+          }
+          break;
+        default:
+          throw new IllegalStateException(String.format("Cannot extract int 
values for column: %s with stored type: %s",
+              dataSchema.getColumnName(colId), storedType));
       }
-
-      switch (columnDataTypes[columnIndex]) {
+    } else {
+      switch (storedType) {
         case INT:
-        case BOOLEAN:
-          rows[rowId] = dataBlock.getInt(rowId, columnIndex);
+          for (int rowId = 0; rowId < numRows; rowId++) {
+            if (nullBitmap.contains(rowId)) {
+              continue;
+            }
+            values[rowId] = dataBlock.getInt(rowId, colId);

Review Comment:
   nit: I know performance is not our priority and the fact that we are doing 
boxing here is a bigger performance issue than what I'm going to say, but:
   
   Depending on `numRows` it may be better to copy all values like in the not 
nullable case and then do a second loop where we nullify the null specific 
rows. Also, we can ask nullBitmap whether all values from `rowId` to `rowId + 
numRows` are null. That should be a very fast operation in roaring bitmaps and 
in case it happens, we can skip the whole loop.
   
   Anyway, this is one of the places where we generate more garbage in V2. We 
really need to refactor this code in the medium term



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] gortiz commented on a diff in pull request #11453: Fix the data type handling in multi-stage engine

Reply via email to