sunchao commented on a change in pull request #34611:
URL: https://github.com/apache/spark/pull/34611#discussion_r751620303
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -470,6 +493,14 @@ public final int appendBooleans(int count, boolean v) {
return result;
}
+ public final int appendBooleans(int count, int src, int offset) {
Review comment:
nit: add some comments to this method, it is a bit confusing that the
input is a int.
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -46,6 +47,7 @@
* WritableColumnVector are intended to be reused.
*/
public abstract class WritableColumnVector extends ColumnVector {
+ private byte[] byte8 = new byte[8];
Review comment:
make this final? and can we make this a `ByteArray` so we don't have to
create a new one each time `putBooleans` is called?
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##########
@@ -542,7 +549,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
runBenchmark("SQL Single Numeric Column Scan") {
- Seq(ByteType, ShortType, IntegerType, LongType, FloatType,
DoubleType).foreach {
+ Seq(BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType,
DoubleType).foreach {
Review comment:
need to update benchmark results with the change
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
##########
@@ -130,6 +130,47 @@ class ColumnarBatchSuite extends SparkFunSuite {
}
}
+ testVector("[SPARK35867] Boolean APIs", 1024, BooleanType) {
Review comment:
nit: don't need to put `[SPARK35867]` here.
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -201,6 +215,15 @@ public WritableColumnVector reserveDictionaryIds(int
capacity) {
*/
public abstract void putBooleans(int rowId, int count, boolean value);
+ /**
+ * Sets bits from [src[srcIndex], src[srcIndex + count]) to [rowId, rowId +
count)
+ * src must be positive and contain 8 bits of bitmask in the lowest byte.
+ */
+ public void putBooleans(int rowId, int count, int src, int srcIndex) {
+ putBytes(rowId, count,
ByteBuffer.wrap(byte8).putLong(toBitPerByte(src)).array(), srcIndex);
+ }
+ public void putBooleans(int rowId, int src) { putBooleans(rowId, 8, src, 0);
}
Review comment:
style nit: let's start a new line for the method body.
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
##########
@@ -130,6 +130,47 @@ class ColumnarBatchSuite extends SparkFunSuite {
}
}
+ testVector("[SPARK35867] Boolean APIs", 1024, BooleanType) {
+ column => ()
Review comment:
nit: `()` is unnecessary?
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##########
@@ -179,6 +180,18 @@ public WritableColumnVector reserveDictionaryIds(int
capacity) {
*/
protected abstract void reserveInternal(int capacity);
+ /**
+ * Each byte of the returned value (long) has one bit from `bits`. I.e. it
is equivalent to
+ * byte[] a = {(byte)(bits >> 0 & 1), (byte)(bits >> 1 & 1),
+ * (byte)(bits >> 2 & 1), (byte)(bits >> 3 & 1),
+ * (byte)(bits >> 4 & 1), (byte)(bits >> 5 & 1),
+ * (byte)(bits >> 6 & 1), (byte)(bits >> 7 & 1)};
+ * return ByteBuffer.wrap(a).getLong();
+ */
+ protected final long toBitPerByte(int bits) {
+ return ((bits * 0x8040201008040201L) >>> 7) & 0x101010101010101L;
Review comment:
Thanks for the reference! This is not easy to understand I think. Could
you add a bit more explanation on how the multiplication/shift/mask works in
the comments? A reference link on the approach would help too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]