[GitHub] [spark] sunchao commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

GitBox Wed, 17 Nov 2021 12:55:01 -0800


sunchao commented on a change in pull request #34611:
URL: https://github.com/apache/spark/pull/34611#discussion_r751620303




##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -470,6 +493,14 @@ public final int appendBooleans(int count, boolean v) {
     return result;
   }
 
+  public final int appendBooleans(int count, int src, int offset) {

Review comment:
       nit: add some comments to this method, it is a bit confusing that the 
input is a int.

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -46,6 +47,7 @@
  * WritableColumnVector are intended to be reused.
  */
 public abstract class WritableColumnVector extends ColumnVector {
+  private byte[] byte8 = new byte[8];

Review comment:
       make this final? and can we make this a `ByteArray` so we don't have to 
create a new one each time `putBooleans` is called?

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##########
@@ -542,7 +549,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 
   override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
     runBenchmark("SQL Single Numeric Column Scan") {
-      Seq(ByteType, ShortType, IntegerType, LongType, FloatType, 
DoubleType).foreach {
+      Seq(BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, 
DoubleType).foreach {

Review comment:
       need to update benchmark results with the change

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
##########
@@ -130,6 +130,47 @@ class ColumnarBatchSuite extends SparkFunSuite {
       }
   }
 
+  testVector("[SPARK35867] Boolean APIs", 1024, BooleanType) {

Review comment:
       nit: don't need to put `[SPARK35867]` here.

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -201,6 +215,15 @@ public WritableColumnVector reserveDictionaryIds(int 
capacity) {
    */
   public abstract void putBooleans(int rowId, int count, boolean value);
 
+  /**
+   * Sets bits from [src[srcIndex], src[srcIndex + count]) to [rowId, rowId + 
count)
+   * src must be positive and contain 8 bits of bitmask in the lowest byte.
+   */
+  public void putBooleans(int rowId, int count, int src, int srcIndex) {
+    putBytes(rowId, count, 
ByteBuffer.wrap(byte8).putLong(toBitPerByte(src)).array(), srcIndex);
+  }
+  public void putBooleans(int rowId, int src) { putBooleans(rowId, 8, src, 0); 
}

Review comment:
       style nit: let's start a new line for the method body.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
##########
@@ -130,6 +130,47 @@ class ColumnarBatchSuite extends SparkFunSuite {
       }
   }
 
+  testVector("[SPARK35867] Boolean APIs", 1024, BooleanType) {
+    column => ()

Review comment:
       nit: `()` is unnecessary?

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -179,6 +180,18 @@ public WritableColumnVector reserveDictionaryIds(int 
capacity) {
    */
   protected abstract void reserveInternal(int capacity);
 
+  /**
+   * Each byte of the returned value (long) has one bit from `bits`. I.e. it 
is equivalent to
+   *    byte[] a = {(byte)(bits >> 0 & 1), (byte)(bits >> 1 & 1),
+   *                (byte)(bits >> 2 & 1), (byte)(bits >> 3 & 1),
+   *                (byte)(bits >> 4 & 1), (byte)(bits >> 5 & 1),
+   *                (byte)(bits >> 6 & 1), (byte)(bits >> 7 & 1)};
+   *    return ByteBuffer.wrap(a).getLong();
+   */
+  protected final long toBitPerByte(int bits) {
+    return ((bits * 0x8040201008040201L) >>> 7) & 0x101010101010101L;

Review comment:
       Thanks for the reference! This is not easy to understand I think. Could 
you add  a bit more explanation on how the multiplication/shift/mask works in 
the comments? A reference link on the approach would help too. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

Reply via email to