[GitHub] [pinot] walterddr commented on a diff in pull request #9755: [multistage] leaf-node return data in non-serialized format

GitBox Wed, 09 Nov 2022 14:11:29 -0800


walterddr commented on code in PR #9755:
URL: https://github.com/apache/pinot/pull/9755#discussion_r1018462505



##########
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/blocks/TransferableBlockUtils.java:
##########
@@ -59,37 +60,67 @@ public static boolean isNoOpBlock(TransferableBlock 
transferableBlock) {
   }
 
   /**
-   *  Split a block into multiple block so that each block size is within 
maxBlockSize.
-   *  Currently, we only support split for row type dataBlock.
-   *  For columnar data block, we return the original data block.
-   *  Metadata data block split is not supported.
+   * Split block into multiple blocks. Default without any clean up.
    *
-   *  When row size is greater than maxBlockSize, we pack each row as a 
separate block.
+   * @see TransferableBlockUtils#splitBlock(TransferableBlock, 
BaseDataBlock.Type, int, boolean)
    */
   public static List<TransferableBlock> splitBlock(TransferableBlock block, 
BaseDataBlock.Type type, int maxBlockSize) {
+    return splitBlock(block, type, maxBlockSize, false);
+  }
+
+  /**
+   *
+   *  Split a block into multiple block so that each block size is within 
maxBlockSize. Currently,
+   *  <ul>
+   *    <li>For row data block, we split for row type dataBlock.</li>
+   *    <li>For columnar data block, we return the original data block.</li>
+   *    <li>For metadata block, split is not supported.</li>
+   *  </ul>
+   *
+   * @param block the data block
+   * @param type type of block
+   * @param maxBlockSize Each chunk of data is less than maxBlockSize
+   * @param isCleanupRequired whether clean up is required, set to true if the 
block is constructed from leaf stage.
+   * @return a list of data block chunks
+   */
+  public static List<TransferableBlock> splitBlock(TransferableBlock block, 
BaseDataBlock.Type type, int maxBlockSize,
+      boolean isCleanupRequired) {
     List<TransferableBlock> blockChunks = new ArrayList<>();
     if (type != DataBlock.Type.ROW) {
       return Collections.singletonList(block);
     } else {
-      int rowSizeInBytes = ((RowDataBlock) 
block.getDataBlock()).getRowSizeInBytes();
-      int numRowsPerChunk = maxBlockSize / rowSizeInBytes;
+      int estimatedRowSizeInBytes = 
block.getDataSchema().getColumnNames().length * 8;
+      int numRowsPerChunk = maxBlockSize / estimatedRowSizeInBytes;
       Preconditions.checkState(numRowsPerChunk > 0, "row size too large for 
query engine to handle, abort!");
 
-      int totalNumRows = block.getNumRows();
-      List<Object[]> allRows = block.getContainer();
-      int currentRow = 0;
-      while (currentRow < totalNumRows) {
-        List<Object[]> chunk = allRows.subList(currentRow, Math.min(currentRow 
+ numRowsPerChunk, allRows.size()));
-        currentRow += numRowsPerChunk;
+      Collection<Object[]> allRows = block.getContainer();
+      DataSchema dataSchema = block.getDataSchema();
+      int rowId = 0;
+      List<Object[]> chunk = new ArrayList<>(numRowsPerChunk);
+      for (Object[] row : allRows) {
+        if (isCleanupRequired) {
+          chunk.add(cleanupRow(row, dataSchema));
+        } else {
+          chunk.add(row);
+        }
+        rowId++;
+        if (rowId % numRowsPerChunk == 0) {
+          blockChunks.add(new TransferableBlock(chunk, block.getDataSchema(), 
block.getType()));
+          chunk = new ArrayList<>();
+        }
+      }
+      if (chunk.size() > 0) {
         blockChunks.add(new TransferableBlock(chunk, block.getDataSchema(), 
block.getType()));
       }
     }
     return blockChunks;
   }
 
-  public static Object[] getRow(TransferableBlock transferableBlock, int 
rowId) {
-    Preconditions.checkState(transferableBlock.getType() == DataBlock.Type.ROW,
-        "TransferableBlockUtils doesn't support get row from non-ROW-based 
data block type yet!");
-    return transferableBlock.getContainer().get(rowId);
+  private static Object[] cleanupRow(Object[] row, DataSchema dataSchema) {

Review Comment:
   yeah. maybe `canonicalizeRow`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] walterddr commented on a diff in pull request #9755: [multistage] leaf-node return data in non-serialized format

Reply via email to