RussellSpitzer commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r737694596



##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java
##########
@@ -47,18 +68,71 @@ public final ColumnarBatch read(ColumnarBatch reuse, int 
numRowsToRead) {
       closeVectors();
     }
 
+    Pair<int[], Integer> rowIdMapping = rowIdMapping(numRowsToRead);
+
     for (int i = 0; i < readers.length; i += 1) {
       vectorHolders[i] = readers[i].read(vectorHolders[i], numRowsToRead);
       int numRowsInVector = vectorHolders[i].numValues();
       Preconditions.checkState(
           numRowsInVector == numRowsToRead,
           "Number of rows in the vector %s didn't match expected %s ", 
numRowsInVector,
           numRowsToRead);
-      arrowColumnVectors[i] =
-          IcebergArrowColumnVector.forHolder(vectorHolders[i], 
numRowsInVector);
+
+      if (rowIdMapping == null) {
+        arrowColumnVectors[i] = 
IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
+      } else {
+        int[] rowIdMap = rowIdMapping.first();
+        Integer numRows = rowIdMapping.second();
+        arrowColumnVectors[i] = 
ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMap, numRows);
+      }
     }
+
+    rowStartPosInBatch += numRowsToRead;
     ColumnarBatch batch = new ColumnarBatch(arrowColumnVectors);
-    batch.setNumRows(numRowsToRead);
+
+    if (rowIdMapping == null) {
+      batch.setNumRows(numRowsToRead);
+    } else {
+      Integer numRows = rowIdMapping.second();
+      batch.setNumRows(numRows);
+    }
     return batch;
   }
+
+  private Pair<int[], Integer> rowIdMapping(int numRows) {
+    if (deletes != null && deletes.hasPosDeletes()) {
+      return buildRowIdMapping(deletes.deletedRowPositions(), numRows);
+    } else {
+      return null;
+    }
+  }
+
+  /**
+   * Build a row id mapping inside a batch, which skips delete rows. For 
example, if the 1st and 3rd rows are deleted in
+   * a batch with 5 rows, the mapping would be {0->1, 1->3, 2->4}, and the new 
num of rows is 3.
+   * @param deletedRowPositions a set of deleted row positions
+   * @param numRows the num of rows
+   * @return the mapping array and the new num of rows in a batch, null if no 
row is deleted
+   */
+  private Pair<int[], Integer> buildRowIdMapping(Roaring64Bitmap 
deletedRowPositions, int numRows) {

Review comment:
       I'm a bit worried about this setup, it seems like for every batch we are 
saving an array which could end up being much larger than the data in question. 
This may be worth checking out later but imagine the use case:
   
   Say I have a query where I prune away all but an integer column, this column 
is uniform amongst all rows so the in memory column vector representation is 
essentially nil, but a single record is deleted. My in memory batch looks like
   ```
   rowIdMap = # of Rows * Size of Integer
   columnData = Size of 4~ Integers? (I don't know exactly how it stores this 
but I assume with the format it's something like number of entries + entry + 
some other metadata)
   ```
   
   So we end up using a very large amount of memory relative to the amount of 
memory required for the actual data.
   
   This feels to me like maybe we should be saving some kind of tree structure 
of offsets rather than a 1 to 1 mapping. So you would traverse the tree to find 
the node of the first offset row in a set of given rows and then add that to 
the index being looked for.
   
   So for example given row Ids 
   ```
   [0,1,2,3,4,5,6,7,8,9, ...]
   ```
   With deletes
   ```
   [1, 5, 6, 8]
   ```
   First we create a tree of offsets by going through the deletes grouping any 
contiguous deletes
   ```
   [1, 1]
   [5, 3] // Because if we had two deleted rows any request for rows after 5 
needs to be offset by 2 additional rows
   [8, 4]
   ```
   Which becomes
   ```
       [5, 3]
       /     \
   [1, 1]  [8, 4]
   ```
   So if I want to get row 10 in the data set, i return an offset of  4 so I 
know I just add [10 + 4] 
   
   This would make lookup a bit slower (depending on the data structure 
implementation details) but should be much smaller and would require a lot less 
array twiddling. 
   
   That said this isn't a deal breaker for me, and we should definitely have 
good benchmarks and correctness checks in place before we start messing around 
here I think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to